Example Usage
This notebook provides a quick start guide to understanding and exploring the Bilby dataset.
Overview
The dataset contains full text government data, enriched with named entities (people, organizations, companies, etc.) and policy life cycle (PLC) identified using our machine learning models.
Each article includes:
- Article metadata (URL, publication date, newspaper, etc.)
- Article content (title and body in English and original language)
- Extracted entities with confidence scores and character offsets
- Policy life cycle (PLC) classification with confidence scores
Table of Contents
- Load and Inspect Data
- Dataset Overview
- Temporal Analysis
- Source Analysis
- Working with Extracted Entities
- Entity Statistics
- Sample Entity Exploration for One Article
- Policy Label Classification (PLC)
1. Load and Inspect Data
First, let's load the required libraries and read the dataset.
# Import required libraries import pandas as pd import json from collections import Counter from datetime import datetime import warnings import glob warnings.filterwarnings("ignore") print("✅ Libraries loaded successfully")
✅ Libraries loaded successfully
print("📊 LOADING DATASET") print("=" * 70) # REPLACE WITH YOUR DATASET PATH BELOW data_path = "./data" parquet_files = glob.glob(f"{data_path}/*.parquet") print(f"Found {len(parquet_files)} parquet files") print(f"Files: {[f.split('/')[-1] for f in parquet_files[:5]]}...") # Read and concatenate all files bilby_df = [] for f in parquet_files: try: df = pd.read_parquet(f) bilby_df.append(df) except Exception as e: print(f"❌ Error reading {f}: {e}") bilby_df = pd.concat(bilby_df, ignore_index=True) print(f"\n✅ Dataset loaded: {len(bilby_df):,} rows × {len(bilby_df.columns)} columns") print(f" Total records: {len(bilby_df):,}") print(f" Total columns: {len(bilby_df.columns)}") print(f"Memory usage: {bilby_df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
📊 LOADING DATASET ====================================================================== Found 2 parquet files Files: ['daily_2025-06-03.parquet', 'daily_2025-06-04.parquet']... ✅ Dataset loaded: 10,088 rows × 27 columns Total records: 10,088 Total columns: 27 Memory usage: 168.8 MB
2. Dataset Overview
Let's examine the structure and content of our dataset.
# Display basic information print("📊 Dataset Shape:") print(f" Rows: {bilby_df.shape[0]:,}") print(f" Columns: {bilby_df.shape[1]}") print() print("📋 Column Names:") for i, col in enumerate(bilby_df.columns, 1): print(f" {i:2d}. {col}")
📊 Dataset Shape: Rows: 10,088 Columns: 27 📋 Column Names: 1. uuid 2. branch_id 3. published_at 4. news_line 5. newspaper 6. author 7. article_url 8. title 9. body 10. title_en 11. subhead_en 12. body_en 13. summary 14. translated_summary 15. inserted_at 16. country 17. language 18. subhead 19. first_date 20. is_original 21. copies 22. copy_sources 23. copy_urls 24. extracted_entities_en 25. extracted_entities_en_count 26. plc_label 27. plc_label_scores
# Display column data types print("🔍 Column Data Types:") print() bilby_df.dtypes
🔍 Column Data Types:
uuid object branch_id Int64 published_at datetime64[us, UTC] news_line object newspaper object author object article_url object title object body object title_en object subhead_en object body_en object summary object translated_summary object inserted_at datetime64[us, UTC] country object language object subhead object first_date datetime64[us, UTC] is_original Int64 copies Int64 copy_sources object copy_urls object extracted_entities_en object extracted_entities_en_count float64 plc_label object plc_label_scores object dtype: object
pd.set_option("display.max_columns", None) # Display first few rows print("👀 First 5 Rows of Data:") print() bilby_df.head(5)
👀 First 5 Rows of Data:
uuid branch_id \ 0 5d3fb794-8631-ade6-8337-238377c0b637ee557291-8... <NA> 1 ecd511a5-fbce-8027-6149-75e9837d24716130ac82-a... <NA> 2 7b644162-8959-314a-8fad-bc0e03ba36868d6c0888-7... <NA> 3 37b28e26-885e-9203-88f6-199ee9911cfe9ab6ca56-1... <NA> 4 f61ae2f1-a18e-32d7-bfae-2e1348bfdb84836476b7-3... <NA> published_at news_line newspaper author \ 0 2025-06-03 16:00:00+00:00 official_line Sichuan Daily 1 2025-06-03 16:00:00+00:00 official_line Sichuan Daily 2 2025-06-03 16:00:00+00:00 official_line Sichuan Daily 3 2025-06-03 16:00:00+00:00 official_line Sichuan Daily 4 2025-06-03 16:00:00+00:00 official_line Sichuan Daily article_url title \ 0 https://epaper.scdaily.cn/shtml/scrb/20250604/... 新“铝”程上,广元如何提速? 1 https://epaper.scdaily.cn/shtml/scrb/20250604/... 游客服务保障更高效 2 https://epaper.scdaily.cn/shtml/scrb/20250604/... 民宿长成村落,避暑更有滋味 3 https://epaper.scdaily.cn/shtml/scrb/20250604/... 区域人才,未来可“云共享” 4 https://epaper.scdaily.cn/shtml/scrb/20250604/... 共计发放奖励260余万元 body \ 0 □四川日报全媒体记者 张敏\n近日,广元市铝基新材料产业投资推介会暨经济合作项目签约仪式举行... 1 本报讯\n 5月27日,剑门关景区北门,几名身着红马甲的党员志愿者站在入口位置,看到一位游客... 2 □四川日报全媒体记者 张敏 文/图\n5月30日,2025四川(曾家山)公路自行车赛暨“骑遍... 3 本报讯\n 5月29日,记者从广元市利州区委组织部获悉,截至目前,今年利州区11家重点工业、... 4 本报讯\n 近日,广元市民马文华发现利州区南河街道接官亭社区大一污水处理厂外的马路有塌陷,影... title_en \ 0 How can Guangyuan speed up on the new "aluminu... 1 Tourist services are more efficient and guaran... 2 Homestays have become villages, and summer vac... 3 Regional talents, can be "cloud shared" in the... 4 A total of more than 2.6 million yuan in rewar... subhead_en \ 0 A total of 31 aluminum-based new material indu... 1 Thousands of party members and cadres in Jiang... 2 Zengjiashan Mountain Resort transforms into a ... 3 11 key enterprises in Lizhou District recruit ... 4 More than 100,000 people have joined the "Guan... body_en \ 0 Sichuan Daily All-Media Reporter Zhang Min\nRe... 1 News from this newspaper\nOn May 27th, at the ... 2 Sichuan Daily All-Media Reporter Zhang Min Tex... 3 News from this newspaper\nOn May 29, it was le... 4 News from this newspaper\nRecently, a resident... summary \ 0 广元市铝基新材料产业投资推介会签约31个项目,总金额超210亿元。广元铝产业发展迅速,全产业... 1 剑阁县党员干部志愿者在景区提供服务,缓解游客激增压力,为60万人次游客提供帮助。志愿者接受培... 2 2025四川(曾家山)公路自行车赛在曾家山举行,展示了曾家山的生态美景。曾家山发展民宿,提升... 3 广元市利州区与陕西省宁强县、甘肃省文县联合举办招聘会,吸纳125人到企业就业。未来还有100... 4 广元市民通过“广元安全隐患随手拍”系统成功反映道路塌陷问题,得到1000元举报奖励。该系统运... translated_summary \ 0 Guangyuan City signed contracts for 31 project... 1 Party members, cadres, and volunteers in Jiang... 2 The 2025 Sichuan (Zengjiashan) road cycling ra... 3 Lizhou District of Guangyuan City, in conjunct... 4 Guangyuan citizens successfully reported road ... inserted_at country language subhead \ 0 2025-06-03 23:41:31+00:00 China Chinese 一次签约31个铝基新材料产业相关项目,金额超210亿元 1 2025-06-03 23:41:31+00:00 China Chinese 剑阁县上千名党员干部“驻点下沉” 2 2025-06-03 23:41:31+00:00 China Chinese 曾家山朝全景避暑康养旅游地转型 3 2025-06-03 23:41:31+00:00 China Chinese 利州区11家重点企业跨省“招贤” 4 2025-06-03 23:41:31+00:00 China Chinese 10万余人加入“广元安全隐患随手拍” first_date is_original copies copy_sources copy_urls \ 0 0010-01-19 15:54:17+00:00 1 1 [] [] 1 0010-01-19 15:54:17+00:00 1 1 [] [] 2 0010-01-19 15:54:17+00:00 1 1 [] [] 3 0010-01-19 15:54:17+00:00 1 1 [] [] 4 0010-01-19 15:54:17+00:00 1 1 [] [] extracted_entities_en \ 0 [{"extracted_entity_text": "Sichuan Daily All-... 1 [{"extracted_entity_text": "Jianmen Pass", "ex... 2 [{"extracted_entity_text": "Sichuan Daily All-... 3 [{"extracted_entity_text": "Organization Depar... 4 [{"extracted_entity_text": "Guangyuan City", "... extracted_entities_en_count plc_label plc_label_scores 0 51.0 NOT_POLICY [0.9998, 0.0001, 0.0, 0.0001] 1 4.0 NOT_POLICY [0.9995, 0.0002, 0.0002, 0.0001] 2 67.0 NOT_POLICY [0.9998, 0.0001, 0.0001, 0.0] 3 26.0 NOT_POLICY [0.9928, 0.0008, 0.0061, 0.0003] 4 15.0 NOT_POLICY [0.9996, 0.0002, 0.0001, 0.0001]
3. Temporal Analysis
Understanding the time range in the dataset.
# Convert inserted_at to datetime if needed bilby_df["inserted_at"] = pd.to_datetime(bilby_df["inserted_at"]) print("📅 Temporal Coverage:") print() print(f" Earliest article: {bilby_df['inserted_at'].min()}") print(f" Latest article: {bilby_df['inserted_at'].max()}") print( f" Date range: {(bilby_df['inserted_at'].max() - bilby_df['inserted_at'].min()).days} days" ) print() # Extract date component bilby_df["date_only"] = bilby_df["inserted_at"].dt.date print(f" Unique dates: {bilby_df['date_only'].nunique()}")
📅 Temporal Coverage: Earliest article: 2025-06-03 00:01:05+00:00 Latest article: 2025-06-04 23:17:53+00:00 Date range: 1 days Unique dates: 2
4. Source Analysis
Analyzing the sources of articles (newspapers and news lines).
# Newspaper analysis print("📰 Newspaper Analysis:") print() if "newspaper" in bilby_df.columns: newspaper_counts = bilby_df["newspaper"].value_counts() print(f" Total unique newspapers: {bilby_df['newspaper'].nunique()}") print() print(" Top 10 newspapers by article count:") for newspaper, count in newspaper_counts.head(10).items(): percentage = (count / len(bilby_df)) * 100 print(f" {newspaper}: {count:,} articles ({percentage:.1f}%)") else: print(" ⚠️ 'newspaper' column not found in dataset")
📰 Newspaper Analysis: Total unique newspapers: 399 Top 10 newspapers by article count: Securities Daily: 856 articles (8.5%) ChinaNationalPharmaceuticalPackagingAssociation: 540 articles (5.4%) Hikvision: 349 articles (3.5%) Wen Wei Po: 260 articles (2.6%) Ta Kung Pao: 230 articles (2.3%) People's Daily: 186 articles (1.8%) Tianjin daily: 173 articles (1.7%) Guizhou Daily: 169 articles (1.7%) Xinhua Daily: 159 articles (1.6%) Procuratorate daily: 145 articles (1.4%)
# News line analysis print("📡 News Line Analysis:") print() if "news_line" in bilby_df.columns: news_line_counts = bilby_df["news_line"].value_counts() print(" Total unique news lines: {bilby_df['news_line'].nunique()}") print() print(" Top 10 news lines by article count:") for news_line, count in news_line_counts.head(10).items(): percentage = (count / len(bilby_df)) * 100 print(f" {news_line}: {count:,} articles ({percentage:.1f}%)") else: print(" ⚠️ 'news_line' column not found in dataset")
📡 News Line Analysis: Total unique news lines: {bilby_df['news_line'].nunique()} Top 10 news lines by article count: official_line: 6,990 articles (69.3%) IndustryAssociation: 1,682 articles (16.7%) private_enterprise: 588 articles (5.8%) ministry: 361 articles (3.6%) SOE: 240 articles (2.4%) private_line: 149 articles (1.5%) party: 45 articles (0.4%) stockexchange: 25 articles (0.2%) bank: 8 articles (0.1%)
5. Working with Extracted Entities
The extracted_entities_en column contains JSON data with detailed information
about each extracted entity.
Entity JSON Structure
Each entity in the JSON array has the following fields:
- extracted_entity_text: The actual text of the entity
- extracted_entity_type: Type/category (see below for complete list)
- score: Confidence score (0-1)
- start: Starting character position in the text
- end: Ending character position in the text
- occurrence_count: How many times this entity appears
- model: The model used for extraction
- timestamp: When the extraction was performed
- source_document_uuid: Reference to the source article
Entity Types Extracted
Our model identifies the following entity types:
- Person: Individual people (e.g., "Xi Jinping", "Elon Musk")
- Company: Commercial organizations and businesses (e.g., "Apple Inc.", "Alibaba Group")
- Government Body: Government agencies and departments (e.g., "Ministry of Finance", "SEC")
- Party Body: Political party organizations (e.g., "Communist Party", "Democratic Party")
- NGO (Non-Governmental Organization): Non-profit and civil society organizations (e.g., "Red Cross", "WWF")
- IGO (Intergovernmental Organization): International governmental organizations (e.g., "United Nations", "World Bank")
- GPE (Geo-Political Entity): Geographic locations with political significance (e.g., "China", "California", "Beijing")
- Currency Mention: References to monetary amounts (e.g., "$1 billion", "100 million yuan")
- Event: Named events and occurrences (e.g., "World Cup", "Spring Festival")
- Initiative: Programs, policies, and strategic initiatives (e.g., "Belt and Road Initiative", "Green New Deal")
- Miscellaneous Organization: Other organizational entities not covered by the above categories
# Example: Parse and display entities from one article print("🔍 Example: Parsing Extracted Entities") print() # Find an article with entities sample_idx = df[df["extracted_entities_en"].notna()].index[0] sample_entities_str = df.loc[sample_idx, "extracted_entities_en"] # Parse JSON entities = json.loads(sample_entities_str) print(f" Article index: {sample_idx}") print(f" Total entities: {len(entities)}") print() print(" First 3 entities:") print() for i, entity in enumerate(entities[:3], 1): print(f" Entity {i}:") print(f" Text: '{entity['extracted_entity_text']}'") print(f" Type: {entity['extracted_entity_type']}") print(f" Score: {entity['score']:.4f}") print(f" Position: [{entity['start']}:{entity['end']}]") print()
🔍 Example: Parsing Extracted Entities Article index: 0 Total entities: 11 First 3 entities: Entity 1: Text: 'China' Type: GPE Score: 0.9863 Position: [11:16] Entity 2: Text: 'Sun Tzu' Type: Person Score: 0.9788 Position: [258:265] Entity 3: Text: 'China' Type: GPE Score: 0.9857 Position: [1270:1275]
# Helper function to parse all entities from an article def parse_entities(entities_json_str): """ Parse entities from JSON string. Args: entities_json_str: JSON string containing entity data Returns: List of entity dictionaries, or empty list if parsing fails """ if pd.isna(entities_json_str): return [] try: entities = json.loads(entities_json_str) return entities if isinstance(entities, list) else [] except (json.JSONDecodeError, TypeError): return [] # Helper function to extract entities by type def get_entities_by_type(entities_json_str, entity_type): """ Extract entities of a specific type from an article. Args: entities_json_str: JSON string containing entity data entity_type: Type to filter (e.g., 'Person', 'Company', 'GPE') Returns: List of entities matching the specified type """ entities = parse_entities(entities_json_str) return [e for e in entities if e.get("extracted_entity_type") == entity_type] print("✅ Helper functions defined:") print(" - parse_entities(entities_json_str)") print(" - get_entities_by_type(entities_json_str, entity_type)")
✅ Helper functions defined: - parse_entities(entities_json_str) - get_entities_by_type(entities_json_str, entity_type)
# Example: Filter dataframe to get only articles mentioning companies print("📝 Filtering Articles by Entity Type (Company)") print("=" * 80) print() # Create a function to check if an article has company entities def has_company_entities(entities_str): """Check if an article contains any Company entities""" companies = get_entities_by_type(entities_str, "Company") return len(companies) > 0 # Apply filter (on first 500 for performance) sample_df = bilby_df.head(500) company_articles = sample_df[ sample_df["extracted_entities_en"].apply(has_company_entities) ] print(f"Total articles analyzed: {len(sample_df):,}") print( f"Articles with companies: {len(company_articles):,} ({len(company_articles) / len(sample_df) * 100:.1f}%)" ) print() # Show sample articles print("Sample articles mentioning companies:") print() for idx in company_articles.head(5).index: article = df.loc[idx] companies = get_entities_by_type(article["extracted_entities_en"], "Company") company_names = [c["extracted_entity_text"] for c in companies[:3]] print(f"Article: {article['title_en'][:70]}...") print(f" Companies: {', '.join(company_names)}") if len(companies) > 3: print(f" ... and {len(companies) - 3} more") print()
📝 Filtering Articles by Entity Type (Company) ================================================================================ Total articles analyzed: 500 Articles with companies: 92 (18.4%) Sample articles mentioning companies: Article: Climb the ladder first to break through the enemy's pass.... Companies: Article: The story of the vase in the depths of time... Companies: Article: Shizong County strengthens the urban flood control and drainage networ... Companies: Article: The misty rain is beautiful in a different way, welcoming guests with ... Companies: Article: Spring City Science π released... Companies:
# Example: Find all unique companies in the dataset print("🔍 Finding All Unique Companies in Dataset") print("=" * 80) print() all_companies = set() articles_with_companies = 0 # Analyze first 500 articles for performance sample_size = min(500, len(bilby_df)) print(f"Analyzing first {sample_size} articles...") print() for idx, entities_str in enumerate( bilby_df["extracted_entities_en"].dropna().head(sample_size) ): companies = get_entities_by_type(entities_str, "Company") if companies: articles_with_companies += 1 all_companies.update([c["extracted_entity_text"] for c in companies]) print("📊 Results:") print(f" Articles analyzed: {sample_size:,}") print( f" Articles with companies: {articles_with_companies:,} ({articles_with_companies / sample_size * 100:.1f}%)" ) print(f" Unique companies found: {len(all_companies):,}") print() # Show sample companies print("Sample companies (showing first 20):") print() for i, company in enumerate(sorted(all_companies)[:20], 1): print(f" {i:2d}. {company}") if len(all_companies) > 20: print(f" ... and {len(all_companies) - 20} more")
🔍 Finding All Unique Companies in Dataset ================================================================================ Analyzing first 500 articles... 📊 Results: Articles analyzed: 500 Articles with companies: 140 (28.0%) Unique companies found: 476 Sample companies (showing first 20): 1. "Germanwatch," 2. "Innovation Investment Consortium" 3. 48 Group Club 4. @Visual China 5. ANTA Sports 6. AYDO 7. AbbVie 8. Accenture 9. Agence France-Presse 10. Al-Dawaa Medical Services Company 11. Alipay 12. Alumni Seed Fund 13. Amazon 14. Anyi 15. Asia Pacific Aviation 16. AstraZeneca 17. Aurobindo Pharma 18. BGI Genomics 19. BYD 20. Baidu ... and 456 more
6. Entity Statistics
Let's analyze entity extraction patterns across the entire dataset.
# Overall entity statistics print("📊 Entity Extraction Statistics:") print() if "extracted_entities_en_count" in bilby_df.columns: entity_counts = df["extracted_entities_en_count"] print(f" Total entities extracted: {entity_counts.sum():,}") print(f" Average per article: {entity_counts.mean():.1f}") print(f" Median per article: {entity_counts.median():.1f}") print(f" Max entities in one article: {entity_counts.max()}") print(f" Articles with 0 entities: {(entity_counts == 0).sum():,}") print() # Distribution print(" Entity count distribution:") print(f" 0 entities: {(entity_counts == 0).sum():,} articles") print( f" 1-10 entities: {((entity_counts >= 1) & (entity_counts <= 10)).sum():,} articles" ) print( f" 11-25 entities: {((entity_counts >= 11) & (entity_counts <= 25)).sum():,} articles" ) print( f" 26-50 entities: {((entity_counts >= 26) & (entity_counts <= 50)).sum():,} articles" ) print(f" 50+ entities: {(entity_counts > 50).sum():,} articles") else: print(" ⚠️ 'extracted_entities_en_count' column not found")
📊 Entity Extraction Statistics: Total entities extracted: 124,991.0 Average per article: 25.6 Median per article: 17.0 Max entities in one article: 375.0 Articles with 0 entities: 502 Entity count distribution: 0 entities: 502 articles 1-10 entities: 1,136 articles 11-25 entities: 1,565 articles 26-50 entities: 1,022 articles 50+ entities: 663 articles
# Analyze entity types across all articles print("🏷️ Entity Type Distribution:") print() print(" Analyzing entity types across all articles...") all_entity_types = [] # Sample first 1000 articles for performance sample_size = min(1000, len(bilby_df)) sample_df = bilby_df.head(sample_size) for entities_str in sample_df["extracted_entities_en"].dropna(): entities = parse_entities(entities_str) all_entity_types.extend( [e.get("extracted_entity_type", "Unknown") for e in entities] ) type_distribution = Counter(all_entity_types) total_entities_analyzed = len(all_entity_types) print(f" Analyzed {sample_size:,} articles") print(f" Total entities found: {total_entities_analyzed:,}") print() print(" Top 10 entity types:") print() for entity_type, count in type_distribution.most_common(10): percentage = (count / total_entities_analyzed) * 100 print(f" {entity_type:30s}: {count:6,} ({percentage:5.1f}%)")
🏷️ Entity Type Distribution: Analyzing entity types across all articles... Analyzed 1,000 articles Total entities found: 19,397 Top 10 entity types: GPE : 7,839 ( 40.4%) Person : 3,680 ( 19.0%) Government Body : 2,000 ( 10.3%) Event : 1,762 ( 9.1%) Initiative : 890 ( 4.6%) Party Body : 853 ( 4.4%) Company : 847 ( 4.4%) Currency Mention : 781 ( 4.0%) Miscellaneous Organization : 603 ( 3.1%) IGO : 83 ( 0.4%)
# Confidence score distribution print("🎯 Confidence Score Analysis:") print() all_scores = [] for entities_str in sample_df["extracted_entities_en"].dropna(): entities = parse_entities(entities_str) all_scores.extend([e.get("score", 0) for e in entities]) scores_series = pd.Series(all_scores) print(f" Total scores analyzed: {len(all_scores):,}") print() print(f" Average score: {scores_series.mean():.4f}") print(f" Median score: {scores_series.median():.4f}") print(f" Min score: {scores_series.min():.4f}") print(f" Max score: {scores_series.max():.4f}") print() print(" Score distribution:") print( f" 0.90 - 1.00: {((scores_series >= 0.90) & (scores_series <= 1.00)).sum():,} entities ({((scores_series >= 0.90) & (scores_series <= 1.00)).sum() / len(all_scores) * 100:.1f}%)" ) print( f" 0.80 - 0.90: {((scores_series >= 0.80) & (scores_series < 0.90)).sum():,} entities ({((scores_series >= 0.80) & (scores_series < 0.90)).sum() / len(all_scores) * 100:.1f}%)" ) print( f" 0.70 - 0.80: {((scores_series >= 0.70) & (scores_series < 0.80)).sum():,} entities ({((scores_series >= 0.70) & (scores_series < 0.80)).sum() / len(all_scores) * 100:.1f}%)" ) print( f" 0.60 - 0.70: {((scores_series >= 0.60) & (scores_series < 0.70)).sum():,} entities ({((scores_series >= 0.60) & (scores_series < 0.70)).sum() / len(all_scores) * 100:.1f}%)" ) print( f" < 0.60: {(scores_series < 0.60).sum():,} entities ({(scores_series < 0.60).sum() / len(all_scores) * 100:.1f}%)" )
🎯 Confidence Score Analysis: Total scores analyzed: 19,397 Average score: 0.9399 Median score: 0.9930 Min score: 0.2041 Max score: 0.9991 Score distribution: 0.90 - 1.00: 16,265 entities (83.9%) 0.80 - 0.90: 1,000 entities (5.2%) 0.70 - 0.80: 692 entities (3.6%) 0.60 - 0.70: 550 entities (2.8%) < 0.60: 890 entities (4.6%)
7. Sample Entity Exploration for One Article
Let's look at a complete example showing how entities relate to the article text.
# Find an article with a good number of entities print("📖 Sample Article with Extracted Entities") print("=" * 80) # Find article with 15-30 entities for a good example if "extracted_entities_en_count" in bilby_df.columns: sample_articles = bilby_df[ (bilby_df["extracted_entities_en_count"] >= 15) & (bilby_df["extracted_entities_en_count"] <= 30) ] if len(sample_articles) > 0: sample_idx = sample_articles.index[0] else: sample_idx = bilby_df[bilby_df["extracted_entities_en"].notna()].index[0] else: sample_idx = bilby_df[bilby_df["extracted_entities_en"].notna()].index[0] sample_article = bilby_df.loc[sample_idx] print() print(f"Article Title: {sample_article['title_en'][:100]}...") print() print(f"Published: {sample_article['inserted_at']}") if "newspaper" in sample_article: print(f"Source: {sample_article['newspaper']}") print() print("Body (first 500 characters):") print(sample_article["body_en"][:500] + "...") print()
📖 Sample Article with Extracted Entities ================================================================================ Article Title: True love crosses mountains and seas, blessing warms the border.... Published: 2025-01-22 23:29:38+00:00 Source: China National Defense News Body (first 500 characters): This report is from Zhang Feiran and Xing Dong: In mid-January, the border and coastal defense forces stationed in Xinjiang, Tibet, and other places gradually received New Year's gift packages from Lanzhou City, Gansu Province. Various delicious foods and heartfelt New Year greeting cards made the officers and soldiers of a border defense company feel warm. "As the Spring Festival approaches, the soldiers are stationed on the snowy plateau and border islands, guarding the country's border and p...
# Display entities from the sample article entities = parse_entities(sample_article["extracted_entities_en"]) print(f"🏷️ Extracted Entities: {len(entities)}") print("=" * 80) print() # Group by type entities_by_type = {} for entity in entities: entity_type = entity.get("extracted_entity_type", "Unknown") if entity_type not in entities_by_type: entities_by_type[entity_type] = [] entities_by_type[entity_type].append(entity) # Display by type for entity_type, type_entities in sorted(entities_by_type.items()): print(f"\n{entity_type} ({len(type_entities)} entities):") print("-" * 60) for i, entity in enumerate(type_entities[:5], 1): # Show max 5 per type text = entity["extracted_entity_text"] score = entity["score"] start = entity["start"] end = entity["end"] # Verify the entity matches the text at the specified position body_text = sample_article["body_en"] extracted_text = ( body_text[start:end] if start < len(body_text) and end <= len(body_text) else "[OUT OF RANGE]" ) match_status = "✓" if extracted_text == text else "✗" print(f" [{i}] '{text}'") print( f" Score: {score:.4f} | Position: [{start}:{end}] | Match: {match_status}" ) if len(type_entities) > 5: print(f" ... and {len(type_entities) - 5} more")
🏷️ Extracted Entities: 17 ================================================================================ Event (1 entities): ------------------------------------------------------------ [1] 'Spring Festival' Score: 0.9858 | Position: [369:384] | Match: ✓ GPE (10 entities): ------------------------------------------------------------ [1] 'Xinjiang' Score: 0.9969 | Position: [115:123] | Match: ✓ [2] 'Tibet' Score: 0.9960 | Position: [125:130] | Match: ✓ [3] 'Lanzhou City' Score: 0.9972 | Position: [198:210] | Match: ✓ [4] 'Gansu Province' Score: 0.9972 | Position: [212:226] | Match: ✓ [5] 'Lanzhou' Score: 0.9981 | Position: [957:964] | Match: ✓ ... and 5 more Government Body (2 entities): ------------------------------------------------------------ [1] 'People's Liberation Army' Score: 0.8645 | Position: [2111:2135] | Match: ✓ [2] 'City military-civilian support office' Score: 0.6770 | Position: [2673:2710] | Match: ✓ Person (4 entities): ------------------------------------------------------------ [1] 'Zhang Feiran' Score: 0.9986 | Position: [20:32] | Match: ✓ [2] 'Xing Dong' Score: 0.9978 | Position: [37:46] | Match: ✓ [3] 'Luo Na' Score: 0.9983 | Position: [640:646] | Match: ✓ [4] 'Li Haoran' Score: 0.9987 | Position: [1774:1783] | Match: ✓
8. Policy Label Classification (PLC)
The dataset includes policy label classifications that categorize articles based on their content type. Each article has been classified by a machine learning model into one of four categories:
- NOT_POLICY: Articles that do not discuss policy matters
- INFORMING: Articles that inform about existing policies, regulations, or government actions
- PLANNING: Articles discussing future policy plans or proposals
- IMPLEMENTING: Articles about the execution or implementation of policies
The plc_label column contains the predicted label (the category with the
highest confidence), and plc_label_scores contains the confidence scores for
all four categories as a list:
[NOT_POLICY, INFORMING, PLANNING, IMPLEMENTING].
# PLC Label Distribution print("🏷️ Policy Label Classification (PLC) Distribution:") print() if "plc_label" in bilby_df.columns: plc_counts = bilby_df["plc_label"].value_counts() print(f" Total articles with PLC labels: {bilby_df['plc_label'].notna().sum():,}") print() print(" Label distribution:") for label, count in plc_counts.items(): percentage = (count / len(bilby_df)) * 100 print(f" {label:20s}: {count:6,} ({percentage:5.1f}%)") else: print(" ⚠️ 'plc_label' column not found")
🏷️ Policy Label Classification (PLC) Distribution: Total articles with PLC labels: 92,504 Label distribution: NOT_POLICY : 79,663 ( 86.1%) INFORMING : 8,961 ( 9.7%) DECIDING : 3,478 ( 3.8%) IMPLEMENTING : 402 ( 0.4%)
# Explore articles in the INFORMING category print("🔍 Exploring INFORMING Category Articles:") print("=" * 80 + "\n") # Helper function to parse PLC scores def parse_plc_scores(scores): """Parse PLC scores - handles both JSON strings and numpy arrays""" try: if isinstance(scores, str): scores = json.loads(scores) return [float(s) for s in scores] except: return None # Get INFORMING articles plc_filter = "INFORMING" plc_filtered_articles = bilby_df[bilby_df["plc_label"] == plc_filter] print(f"Total {plc_filter} articles: {len(plc_filtered_articles):,}\n") # Show 3 sample articles with their scores print(f"Sample {plc_filter} articles:\n") for idx, (_, article) in enumerate(plc_filtered_articles.head(3).iterrows(), 1): scores = parse_plc_scores(article["plc_label_scores"]) print(f"[{idx}] {article['title_en'][:80]}...") print(f" Newspaper: {article['newspaper']}") if scores: print(f" Scores: {scores}") print(f" Label confidence: {scores[1]:.3%} ({plc_filter})") else: print(f" Scores: N/A") print(f" Summary: {article['translated_summary'][:150]}...") print() print("(Scores represent: [NOT_POLICY, INFORMING, PLANNING, IMPLEMENTING])")
🔍 Exploring INFORMING Category Articles: ================================================================================ Total INFORMING articles: 8,961 Sample INFORMING articles: [1] The National Special Equipment Safety Supervision Work Symposium will be held in... Newspaper: China Quality Daily Scores: [0.0008, 0.9988, 0.0002, 0.0002] Label confidence: 99.880% (INFORMING) Summary: The National Symposium on the Safety Supervision of Special Equipment in 2025 summarized the work of 2024, emphasizing the need to strictly adhere to ... [2] The State Administration for Market Regulation held a meeting on the constructio... Newspaper: China Quality Daily Scores: [0.4614, 0.5308, 0.0065, 0.0013] Label confidence: 53.080% (INFORMING) Summary: The Market Supervision Administration held a meeting on party conduct and clean governance construction, summarizing the work of 2024, analyzing the s... [3] Shandong: Anchor the target and take on the main beam, leading the way and makin... Newspaper: China Quality Daily Scores: [0.0003, 0.9992, 0.0003, 0.0002] Label confidence: 99.920% (INFORMING) Summary: The market supervision work conference in Shandong Province summarized the work of 2024 and arranged the key tasks for 2025. It emphasized the need to... (Scores represent: [NOT_POLICY, INFORMING, PLANNING, IMPLEMENTING])
Summary
This notebook covered:
✅ Loading and inspecting the dataset structure
✅ Understanding temporal coverage and article distribution
✅ Analyzing article sources (newspapers and news lines)
✅ Working with the JSON entity data structure
✅ Computing entity statistics and distributions
✅ Exploring sample entities in context
✅ Understanding policy label classifications (PLC)
You now have the foundation to perform custom analyses on this entity extraction dataset!
For questions or support, please contact our support team.