Architecture Overview

SciLEx architecture and core components.

System Overview

User Config (YAML)
    ↓
Collection System → APIs → JSON Storage
    ↓
Aggregation Pipeline → Filtering
    ↓
CSV Output / Zotero Export

Core Components

1. Collection System

Location: src/crawlers/collector_collection.py

Orchestrates parallel API collection:

Creates jobs from config (keywords × years × APIs)
Runs collectors in parallel (multiprocessing)
Tracks progress and handles errors
Skips completed queries (idempotent)

API Collectors (src/crawlers/collectors.py):

Base class: API_collector
11 active implementations: SemanticScholar, OpenAlex, IEEE, Elsevier, Springer, arXiv, HAL, DBLP, ISTEX, OpenAIRE, ORKG
1 deprecated: GoogleScholar
Each handles query building, pagination, and response parsing

2. Aggregation Pipeline

Location: src/aggregate_collect.py

Processes collected papers:

Load JSON files from all APIs
Convert to unified format
Deduplicate (DOI, URL, fuzzy title)
Apply keyword filtering
Score quality
Filter by citations
Rank by relevance
Output to CSV

Parallel Mode (src/crawlers/aggregate_parallel.py):

Multiprocessing for speed
Batch processing (5000 papers/batch)
Auto-detects CPU count

3. Format Converters

Location: src/crawlers/aggregate.py

Convert API-specific formats to unified schema:

One converter function per API
Maps to Zotero-compatible format
Uses MISSING_VALUE sentinel for missing fields (never None or "")

Converters registered in FORMAT_CONVERTERS dict:

SemanticScholartoZoteroFormat
IstextoZoteroFormat
ArxivtoZoteroFormat
DBLPtoZoteroFormat
HALtoZoteroFormat
OpenAlextoZoteroFormat
IEEEtoZoteroFormat
SpringertoZoteroFormat
ElseviertoZoteroFormat
OpenAIREtoZoteroFormat
ORKGtoZoteroFormat

4. Filtering Engine

Location: src/aggregate_collect.py

5-phase filtering:

ItemType filter
Keyword filter
Quality filter
Citation filter
Relevance ranking

5. Citation System

Location: src/citations/citations_tools.py

Three-tier strategy:

SQLite cache (instant)
Semantic Scholar data (if available)
OpenCitations API (rate-limited)

Cache location: output/citation_cache.db

6. Zotero Integration

Location: src/Zotero/push_to_Zotero.py

API client for Zotero:

Bulk uploads (50 items/batch)
Duplicate detection by URL
Collection management

Data Flow

Collection

Config → Job Generation → Parallel Workers → API Calls → JSON Files

Each job:

API name
Keyword combination
Year
Output path

Output: output/collect_YYYYMMDD_HHMMSS/{API}/{query_id}/page_*

Aggregation

JSON Files → Format Conversion → Deduplication → Filtering → CSV

Output: aggregated_data.csv with columns:

Core: title, authors, year, DOI, abstract
Publication: itemType, publicationTitle, volume, issue
Metadata: nb_citation, quality_score, relevance_score, archive

Design Patterns

Factory Pattern

API collectors created dynamically:

api_collectors = {
    'SemanticScholar': SemanticScholar_collector,
    'OpenAlex': OpenAlex_collector,
    'OpenAIRE': OpenAIRE_collector,
    'ORKG': ORKG_collector,
    ...
}
collector = api_collectors[api_name](config)

Circuit Breaker

Fails fast for broken APIs:

Tracks consecutive failures
Opens circuit after 5 failures
Skips requests when open

Repository Pattern

Abstracts data storage:

JSON for raw collection data
CSV for aggregated results
SQLite for citation cache

Performance Features

Parallel Collection: Multiple APIs simultaneously
Parallel Aggregation: Batch processing with multiprocessing
Citation Caching: SQLite cache avoids redundant API calls
Circuit Breaker: Skip broken APIs quickly
Rate Limiting: Per-API throttling
Bulk Operations: Zotero uploads in batches

Configuration System

Hierarchical priority:

Default values (in code)
Config files (YAML)
Environment variables
Command-line arguments

Error Handling

Specific exception types (no bare except)
30-second timeouts on all API calls
Retry logic with exponential backoff
State files for recovery

Directory Structure

src/
├── crawlers/
│   ├── collectors.py          # All API collector classes (monolithic)
│   ├── collector_collection.py  # Orchestration and job management
│   ├── aggregate.py           # Format converters (one per API)
│   └── aggregate_parallel.py  # Parallel aggregation
├── citations/
│   └── citations_tools.py
├── Zotero/
│   └── push_to_Zotero.py
├── API tests/                 # Manual API test scripts
├── run_collecte.py            # Main collection entry point
├── aggregate_collect.py       # Main aggregation entry point
├── push_to_Zotero_collect.py  # Zotero export entry point
├── scilex.config.yml          # Search configuration
└── api.config.yml.example     # API credentials template

scilex/                        # Package stubs (in development)
├── api.config.yml             # Active API credentials (not committed)
└── ...

output/
└── collect_*/                 # Timestamped collections
    ├── {API}/                 # Per-API results
    └── aggregated_data.csv    # Final output

Adding New Components

New API Collector

Create collector class in src/crawlers/collectors.py
Implement abstract methods
Add format converter in src/crawlers/aggregate.py
Register in api_collectors dict in src/crawlers/collector_collection.py
Add to config examples

See Adding Collectors for details.

New Filter

Add filter function in src/aggregate_collect.py
Add config options
Insert in filtering pipeline
Update documentation

Testing

Tests in tests/:

test_dual_keyword_logic.py - Keyword matching
test_openaire_collector.py / test_openaire_aggregation.py - OpenAIRE
test_orkg_collector.py / test_orkg_aggregation.py - ORKG
Unit tests for collectors and format converters

Run: uv run python -m pytest tests/