Basic Workflow Guide
Standard workflow for collecting and aggregating papers.
Workflow Steps
Collection - Query APIs and download metadata
Aggregation - Deduplicate and filter
Export - Push to Zotero (optional)
Step 1: Collection
Configure Search
Edit src/scilex.config.yml:
keywords:
- ["machine learning"]
- []
years: [2023, 2024]
apis:
- SemanticScholar
- OpenAlex
fields: ["title", "abstract"]
Run Collection
uv run python src/run_collecte.py
Results saved to output/collect_YYYYMMDD_HHMMSS/
Output structure:
output/collect_20241113_143022/
├── config_used.yml
├── SemanticScholar/
│ ├── 0/ # Query 0: keyword[0] + year[0]
│ │ ├── page_1
│ │ └── page_2
├── OpenAlex/
Idempotent Behavior
Re-running collection skips already completed queries. Safe to re-run without wasting API quotas.
Step 2: Aggregation
Basic Aggregation
uv run python src/aggregate_collect.py
Process:
Loads JSON files
Converts to unified format
Deduplicates (DOI, URL, fuzzy title)
Applies keyword filtering
Scores quality
Saves to CSV
With Citations
Enable in src/scilex.config.yml:
aggregate_get_citations: true
Then run aggregation. Citations are fetched from cache → Semantic Scholar → OpenCitations.
Output
CSV saved to output/collect_*/aggregated_data.csv
Columns:
title,authors,year,DOI,abstractitemType- Publication typepublicationTitle- Journal/conferencecitation_count- Citations (if enabled)quality_score- Metadata completeness (0-100)relevance_score- Relevance (0-10)
Step 3: Export to Zotero
Configure
Edit scilex/api.config.yml:
zotero:
api_key: "your-key"
user_mode: "user" # or "group"
Run Export
uv run python src/push_to_Zotero_collect.py
Papers uploaded in batches. Duplicates skipped by URL.
Filtering Pipeline
Aggregation applies filters:
ItemType - Keep allowed publication types
Keywords - Match search terms
Deduplication - Remove duplicates
Quality - Remove low-quality metadata
Citations - Time-aware thresholds
Relevance - Score and limit to top N
Check logs to see papers filtered at each step.
Complete Example
# src/scilex.config.yml
keywords:
- ["knowledge graph"]
- ["LLM", "large language model"]
years: [2023, 2024]
apis:
- SemanticScholar
- OpenAlex
aggregate_get_citations: true
quality_filters:
enable_itemtype_filter: true
allowed_item_types:
- journalArticle
- conferencePaper
apply_relevance_ranking: true
max_papers: 300
Run:
uv run python src/run_collecte.py
uv run python src/aggregate_collect.py
uv run python src/push_to_Zotero_collect.py
Analyze Results
import pandas as pd
df = pd.read_csv('output/collect_*/aggregated_data.csv', delimiter=';')
print(f"Total papers: {len(df)}")
print(f"\nPapers by year:")
print(df['year'].value_counts().sort_index())
print(f"\nTop cited:")
print(df.nlargest(10, 'nb_citation')[['title', 'nb_citation']])
Log Levels
# Default (clean output)
uv run python src/run_collecte.py
# Detailed progress
LOG_LEVEL=INFO uv run python src/run_collecte.py
# Full debugging
LOG_LEVEL=DEBUG uv run python src/run_collecte.py
Next Steps
Advanced Filtering - Filtering options
Configuration - All config parameters
API Comparison - API details