Architecture Overview
SciLEx architecture and core components.
System Overview
User Config (YAML)
↓
Collection System → APIs → JSON Storage
↓
Aggregation Pipeline → Filtering
↓
CSV Output / Zotero Export
Core Components
1. Collection System
Location: src/crawlers/collector_collection.py
Orchestrates parallel API collection:
Creates jobs from config (keywords × years × APIs)
Runs collectors in parallel (multiprocessing)
Tracks progress and handles errors
Skips completed queries (idempotent)
API Collectors (src/crawlers/collectors.py):
Base class:
API_collector11 active implementations: SemanticScholar, OpenAlex, IEEE, Elsevier, Springer, arXiv, HAL, DBLP, ISTEX, OpenAIRE, ORKG
1 deprecated: GoogleScholar
Each handles query building, pagination, and response parsing
2. Aggregation Pipeline
Location: src/aggregate_collect.py
Processes collected papers:
Load JSON files from all APIs
Convert to unified format
Deduplicate (DOI, URL, fuzzy title)
Apply keyword filtering
Score quality
Filter by citations
Rank by relevance
Output to CSV
Parallel Mode (src/crawlers/aggregate_parallel.py):
Multiprocessing for speed
Batch processing (5000 papers/batch)
Auto-detects CPU count
3. Format Converters
Location: src/crawlers/aggregate.py
Convert API-specific formats to unified schema:
One converter function per API
Maps to Zotero-compatible format
Uses
MISSING_VALUEsentinel for missing fields (neverNoneor"")
Converters registered in FORMAT_CONVERTERS dict:
SemanticScholartoZoteroFormatIstextoZoteroFormatArxivtoZoteroFormatDBLPtoZoteroFormatHALtoZoteroFormatOpenAlextoZoteroFormatIEEEtoZoteroFormatSpringertoZoteroFormatElseviertoZoteroFormatOpenAIREtoZoteroFormatORKGtoZoteroFormat
4. Filtering Engine
Location: src/aggregate_collect.py
5-phase filtering:
ItemType filter
Keyword filter
Quality filter
Citation filter
Relevance ranking
5. Citation System
Location: src/citations/citations_tools.py
Three-tier strategy:
SQLite cache (instant)
Semantic Scholar data (if available)
OpenCitations API (rate-limited)
Cache location: output/citation_cache.db
6. Zotero Integration
Location: src/Zotero/push_to_Zotero.py
API client for Zotero:
Bulk uploads (50 items/batch)
Duplicate detection by URL
Collection management
Data Flow
Collection
Config → Job Generation → Parallel Workers → API Calls → JSON Files
Each job:
API name
Keyword combination
Year
Output path
Output: output/collect_YYYYMMDD_HHMMSS/{API}/{query_id}/page_*
Aggregation
JSON Files → Format Conversion → Deduplication → Filtering → CSV
Output: aggregated_data.csv with columns:
Core: title, authors, year, DOI, abstract
Publication: itemType, publicationTitle, volume, issue
Metadata: nb_citation, quality_score, relevance_score, archive
Design Patterns
Factory Pattern
API collectors created dynamically:
api_collectors = {
'SemanticScholar': SemanticScholar_collector,
'OpenAlex': OpenAlex_collector,
'OpenAIRE': OpenAIRE_collector,
'ORKG': ORKG_collector,
...
}
collector = api_collectors[api_name](config)
Circuit Breaker
Fails fast for broken APIs:
Tracks consecutive failures
Opens circuit after 5 failures
Skips requests when open
Repository Pattern
Abstracts data storage:
JSON for raw collection data
CSV for aggregated results
SQLite for citation cache
Performance Features
Parallel Collection: Multiple APIs simultaneously
Parallel Aggregation: Batch processing with multiprocessing
Citation Caching: SQLite cache avoids redundant API calls
Circuit Breaker: Skip broken APIs quickly
Rate Limiting: Per-API throttling
Bulk Operations: Zotero uploads in batches
Configuration System
Hierarchical priority:
Default values (in code)
Config files (YAML)
Environment variables
Command-line arguments
Error Handling
Specific exception types (no bare
except)30-second timeouts on all API calls
Retry logic with exponential backoff
State files for recovery
Directory Structure
src/
├── crawlers/
│ ├── collectors.py # All API collector classes (monolithic)
│ ├── collector_collection.py # Orchestration and job management
│ ├── aggregate.py # Format converters (one per API)
│ └── aggregate_parallel.py # Parallel aggregation
├── citations/
│ └── citations_tools.py
├── Zotero/
│ └── push_to_Zotero.py
├── API tests/ # Manual API test scripts
├── run_collecte.py # Main collection entry point
├── aggregate_collect.py # Main aggregation entry point
├── push_to_Zotero_collect.py # Zotero export entry point
├── scilex.config.yml # Search configuration
└── api.config.yml.example # API credentials template
scilex/ # Package stubs (in development)
├── api.config.yml # Active API credentials (not committed)
└── ...
output/
└── collect_*/ # Timestamped collections
├── {API}/ # Per-API results
└── aggregated_data.csv # Final output
Adding New Components
New API Collector
Create collector class in
src/crawlers/collectors.pyImplement abstract methods
Add format converter in
src/crawlers/aggregate.pyRegister in
api_collectorsdict insrc/crawlers/collector_collection.pyAdd to config examples
See Adding Collectors for details.
New Filter
Add filter function in
src/aggregate_collect.pyAdd config options
Insert in filtering pipeline
Update documentation
Testing
Tests in tests/:
test_dual_keyword_logic.py- Keyword matchingtest_openaire_collector.py/test_openaire_aggregation.py- OpenAIREtest_orkg_collector.py/test_orkg_aggregation.py- ORKGUnit tests for collectors and format converters
Run: uv run python -m pytest tests/