Architecture Overview

SciLEx architecture and core components.

System Overview

User Config (YAML)
    ↓
Collection System → APIs → JSON Storage
    ↓
Aggregation Pipeline → Filtering
    ↓
CSV Output / Zotero Export

Core Components

1. Collection System

Location: src/crawlers/collector_collection.py

Orchestrates parallel API collection:

  • Creates jobs from config (keywords × years × APIs)

  • Runs collectors in parallel (multiprocessing)

  • Tracks progress and handles errors

  • Skips completed queries (idempotent)

API Collectors (src/crawlers/collectors.py):

  • Base class: API_collector

  • 11 active implementations: SemanticScholar, OpenAlex, IEEE, Elsevier, Springer, arXiv, HAL, DBLP, ISTEX, OpenAIRE, ORKG

  • 1 deprecated: GoogleScholar

  • Each handles query building, pagination, and response parsing

2. Aggregation Pipeline

Location: src/aggregate_collect.py

Processes collected papers:

  1. Load JSON files from all APIs

  2. Convert to unified format

  3. Deduplicate (DOI, URL, fuzzy title)

  4. Apply keyword filtering

  5. Score quality

  6. Filter by citations

  7. Rank by relevance

  8. Output to CSV

Parallel Mode (src/crawlers/aggregate_parallel.py):

  • Multiprocessing for speed

  • Batch processing (5000 papers/batch)

  • Auto-detects CPU count

3. Format Converters

Location: src/crawlers/aggregate.py

Convert API-specific formats to unified schema:

  • One converter function per API

  • Maps to Zotero-compatible format

  • Uses MISSING_VALUE sentinel for missing fields (never None or "")

Converters registered in FORMAT_CONVERTERS dict:

  • SemanticScholartoZoteroFormat

  • IstextoZoteroFormat

  • ArxivtoZoteroFormat

  • DBLPtoZoteroFormat

  • HALtoZoteroFormat

  • OpenAlextoZoteroFormat

  • IEEEtoZoteroFormat

  • SpringertoZoteroFormat

  • ElseviertoZoteroFormat

  • OpenAIREtoZoteroFormat

  • ORKGtoZoteroFormat

4. Filtering Engine

Location: src/aggregate_collect.py

5-phase filtering:

  1. ItemType filter

  2. Keyword filter

  3. Quality filter

  4. Citation filter

  5. Relevance ranking

5. Citation System

Location: src/citations/citations_tools.py

Three-tier strategy:

  1. SQLite cache (instant)

  2. Semantic Scholar data (if available)

  3. OpenCitations API (rate-limited)

Cache location: output/citation_cache.db

6. Zotero Integration

Location: src/Zotero/push_to_Zotero.py

API client for Zotero:

  • Bulk uploads (50 items/batch)

  • Duplicate detection by URL

  • Collection management

Data Flow

Collection

Config → Job Generation → Parallel Workers → API Calls → JSON Files

Each job:

  • API name

  • Keyword combination

  • Year

  • Output path

Output: output/collect_YYYYMMDD_HHMMSS/{API}/{query_id}/page_*

Aggregation

JSON Files → Format Conversion → Deduplication → Filtering → CSV

Output: aggregated_data.csv with columns:

  • Core: title, authors, year, DOI, abstract

  • Publication: itemType, publicationTitle, volume, issue

  • Metadata: nb_citation, quality_score, relevance_score, archive

Design Patterns

Factory Pattern

API collectors created dynamically:

api_collectors = {
    'SemanticScholar': SemanticScholar_collector,
    'OpenAlex': OpenAlex_collector,
    'OpenAIRE': OpenAIRE_collector,
    'ORKG': ORKG_collector,
    ...
}
collector = api_collectors[api_name](config)

Circuit Breaker

Fails fast for broken APIs:

  • Tracks consecutive failures

  • Opens circuit after 5 failures

  • Skips requests when open

Repository Pattern

Abstracts data storage:

  • JSON for raw collection data

  • CSV for aggregated results

  • SQLite for citation cache

Performance Features

  • Parallel Collection: Multiple APIs simultaneously

  • Parallel Aggregation: Batch processing with multiprocessing

  • Citation Caching: SQLite cache avoids redundant API calls

  • Circuit Breaker: Skip broken APIs quickly

  • Rate Limiting: Per-API throttling

  • Bulk Operations: Zotero uploads in batches

Configuration System

Hierarchical priority:

  1. Default values (in code)

  2. Config files (YAML)

  3. Environment variables

  4. Command-line arguments

Error Handling

  • Specific exception types (no bare except)

  • 30-second timeouts on all API calls

  • Retry logic with exponential backoff

  • State files for recovery

Directory Structure

src/
├── crawlers/
│   ├── collectors.py          # All API collector classes (monolithic)
│   ├── collector_collection.py  # Orchestration and job management
│   ├── aggregate.py           # Format converters (one per API)
│   └── aggregate_parallel.py  # Parallel aggregation
├── citations/
│   └── citations_tools.py
├── Zotero/
│   └── push_to_Zotero.py
├── API tests/                 # Manual API test scripts
├── run_collecte.py            # Main collection entry point
├── aggregate_collect.py       # Main aggregation entry point
├── push_to_Zotero_collect.py  # Zotero export entry point
├── scilex.config.yml          # Search configuration
└── api.config.yml.example     # API credentials template

scilex/                        # Package stubs (in development)
├── api.config.yml             # Active API credentials (not committed)
└── ...

output/
└── collect_*/                 # Timestamped collections
    ├── {API}/                 # Per-API results
    └── aggregated_data.csv    # Final output

Adding New Components

New API Collector

  1. Create collector class in src/crawlers/collectors.py

  2. Implement abstract methods

  3. Add format converter in src/crawlers/aggregate.py

  4. Register in api_collectors dict in src/crawlers/collector_collection.py

  5. Add to config examples

See Adding Collectors for details.

New Filter

  1. Add filter function in src/aggregate_collect.py

  2. Add config options

  3. Insert in filtering pipeline

  4. Update documentation

Testing

Tests in tests/:

  • test_dual_keyword_logic.py - Keyword matching

  • test_openaire_collector.py / test_openaire_aggregation.py - OpenAIRE

  • test_orkg_collector.py / test_orkg_aggregation.py - ORKG

  • Unit tests for collectors and format converters

Run: uv run python -m pytest tests/