Basic Workflow Guide

Standard workflow for collecting and aggregating papers.

Workflow Steps

  1. Collection - Query APIs and download metadata

  2. Aggregation - Deduplicate and filter

  3. Export - Push to Zotero (optional)

Step 1: Collection

Run Collection

uv run python src/run_collecte.py

Results saved to output/collect_YYYYMMDD_HHMMSS/

Output structure:

output/collect_20241113_143022/
├── config_used.yml
├── SemanticScholar/
│   ├── 0/              # Query 0: keyword[0] + year[0]
│   │   ├── page_1
│   │   └── page_2
├── OpenAlex/

Idempotent Behavior

Re-running collection skips already completed queries. Safe to re-run without wasting API quotas.

Step 2: Aggregation

Basic Aggregation

uv run python src/aggregate_collect.py

Process:

  1. Loads JSON files

  2. Converts to unified format

  3. Deduplicates (DOI, URL, fuzzy title)

  4. Applies keyword filtering

  5. Scores quality

  6. Saves to CSV

With Citations

Enable in src/scilex.config.yml:

aggregate_get_citations: true

Then run aggregation. Citations are fetched from cache → Semantic Scholar → OpenCitations.

Output

CSV saved to output/collect_*/aggregated_data.csv

Columns:

  • title, authors, year, DOI, abstract

  • itemType - Publication type

  • publicationTitle - Journal/conference

  • citation_count - Citations (if enabled)

  • quality_score - Metadata completeness (0-100)

  • relevance_score - Relevance (0-10)

Step 3: Export to Zotero

Configure

Edit scilex/api.config.yml:

zotero:
  api_key: "your-key"
  user_mode: "user"  # or "group"

Run Export

uv run python src/push_to_Zotero_collect.py

Papers uploaded in batches. Duplicates skipped by URL.

Filtering Pipeline

Aggregation applies filters:

  1. ItemType - Keep allowed publication types

  2. Keywords - Match search terms

  3. Deduplication - Remove duplicates

  4. Quality - Remove low-quality metadata

  5. Citations - Time-aware thresholds

  6. Relevance - Score and limit to top N

Check logs to see papers filtered at each step.

Complete Example

# src/scilex.config.yml
keywords:
  - ["knowledge graph"]
  - ["LLM", "large language model"]

years: [2023, 2024]

apis:
  - SemanticScholar
  - OpenAlex

aggregate_get_citations: true

quality_filters:
  enable_itemtype_filter: true
  allowed_item_types:
    - journalArticle
    - conferencePaper
  apply_relevance_ranking: true
  max_papers: 300

Run:

uv run python src/run_collecte.py
uv run python src/aggregate_collect.py
uv run python src/push_to_Zotero_collect.py

Analyze Results

import pandas as pd

df = pd.read_csv('output/collect_*/aggregated_data.csv', delimiter=';')

print(f"Total papers: {len(df)}")
print(f"\nPapers by year:")
print(df['year'].value_counts().sort_index())
print(f"\nTop cited:")
print(df.nlargest(10, 'nb_citation')[['title', 'nb_citation']])

Log Levels

# Default (clean output)
uv run python src/run_collecte.py

# Detailed progress
LOG_LEVEL=INFO uv run python src/run_collecte.py

# Full debugging
LOG_LEVEL=DEBUG uv run python src/run_collecte.py

Next Steps