# Adding API Collectors Guide Guide for adding new academic API collectors to SciLEx. ## Overview Steps to add a collector: 1. Create collector class 2. Implement required methods 3. Create format converter 4. Register collector 5. Add configuration 6. Test ## Collector Class Create in `src/crawlers/collectors.py`: ```python class YourAPI_collector(API_collector): """Collector for YourAPI.""" def __init__(self, config=None): super().__init__() self.api_name = "YourAPI" # Must match config and registration key self.base_url = "https://api.yourapi.com" self.max_by_page = 100 if config: self.api_key = config.get('yourapi', {}).get('api_key') self.load_rate_limit_from_config(config) # Always call last def get_configurated_url(self): """Return URL template with {} placeholder for page/offset.""" params = f"query={{}}&pageSize={self.max_by_page}" if self.api_key: params += f"&apiKey={self.api_key}" return f"{self.base_url}/search?{params}" def get_offset(self, page): """Return the value to substitute into the URL template for the given page. Examples: - 1-based page (OpenAlex, OpenAIRE style): return page - 0-based page (ORKG style): return page - 1 - Offset-based (DBLP, ISTEX style): return (page - 1) * self.max_by_page """ return page # adjust as needed for your API def query_build(self, keywords, year, fields): """Build the API query string from keywords and year.""" # Single group mode if not keywords[1]: query = " OR ".join(keywords[0]) # Dual group mode else: g1 = "(" + " OR ".join(keywords[0]) + ")" g2 = "(" + " OR ".join(keywords[1]) + ")" query = f"{g1} AND {g2}" return f"{query} AND year:{year}" def parsePageResults(self, response, page): """Parse one page of API response. Must return a dict with keys: date_search, id_collect, page, total, results """ data = response.json() total = data.get("total", 0) results = data.get("items", []) return { "date_search": self.date_search, "id_collect": self.id_collect, "page": page, "total": total, "results": results, } ``` ## Format Converter Add to `src/crawlers/aggregate.py`: ```python from scilex.constants import MISSING_VALUE, is_valid def YourAPItoZoteroFormat(paper): """Convert YourAPI format to Zotero-compatible unified format.""" # Determine item type item_type = 'journalArticle' # Default pub_type = paper.get('type', '').lower() if 'conference' in pub_type: item_type = 'conferencePaper' elif 'book' in pub_type: item_type = 'book' # Format authors authors = paper.get('authors', []) author_str = ', '.join(authors) if authors else MISSING_VALUE return { 'itemType': item_type, 'title': paper.get('title', MISSING_VALUE), 'authors': author_str, 'abstractNote': paper.get('abstract', MISSING_VALUE), 'date': str(paper.get('year', MISSING_VALUE)), 'DOI': paper.get('doi', MISSING_VALUE), 'url': paper.get('url', MISSING_VALUE), 'publicationTitle': paper.get('journal', MISSING_VALUE), 'volume': str(paper.get('volume', MISSING_VALUE)), 'issue': str(paper.get('issue', MISSING_VALUE)), 'pages': paper.get('pages', MISSING_VALUE), 'year': str(paper.get('year', MISSING_VALUE)), 'citation_count': paper.get('citations', 0), } ``` ## Registration ### Register the collector In `src/crawlers/collector_collection.py`: ```python api_collectors = { 'SemanticScholar': SemanticScholar_collector, 'OpenAlex': OpenAlex_collector, # Add your collector 'YourAPI': YourAPI_collector, } ``` ### Register the format converter In `src/crawlers/aggregate.py` (in the `FORMAT_CONVERTERS` dict): ```python FORMAT_CONVERTERS = { 'SemanticScholar': SemanticScholartoZoteroFormat, 'OpenAlex': OpenAlextoZoteroFormat, # Add your converter 'YourAPI': YourAPItoZoteroFormat, } ``` ## Configuration Add to `src/api.config.yml.example`: ```yaml # YourAPI Configuration yourapi: api_key: "your-key-here" # Rate limits rate_limits: YourAPI: 2.0 # requests/second ``` Add to `src/scilex.config.yml` APIs list: ```yaml apis: - SemanticScholar - OpenAlex - YourAPI # Add here ``` ## Testing Create a test script at `src/API tests/YourAPITest.py`: ```python import sys from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) from crawlers.collectors import YourAPI_collector from crawlers.aggregate import YourAPItoZoteroFormat import yaml def test_collector(): # Load config with open('scilex/api.config.yml', 'r') as f: config = yaml.safe_load(f) # Test collection collector = YourAPI_collector(config) papers = collector.run([["test"]], 10, 2024, ["title"]) print(f"Retrieved {len(papers)} papers") if papers: # Test converter zotero_item = YourAPItoZoteroFormat(papers[0]) print(f"Title: {zotero_item['title']}") if __name__ == "__main__": test_collector() ``` Run: ```bash uv run python "src/API tests/YourAPITest.py" ``` For unit tests with fixtures, create `tests/test_yourapi_collector.py`: ```python from unittest.mock import MagicMock import json def test_parse_page_results(): import sys sys.path.insert(0, 'src') from crawlers.collectors import YourAPI_collector collector = YourAPI_collector() mock_response = MagicMock() mock_response.json.return_value = { "total": 1, "items": [{"title": "Test Paper", "year": 2024}] } result = collector.parsePageResults(mock_response, 1) assert result["total"] == 1 assert len(result["results"]) == 1 ``` Run all tests: ```bash uv run python -m pytest tests/ ``` ## Key Points ### Rate Limiting The base `API_collector` class provides `load_rate_limit_from_config()`. Always call it at the end of `__init__` after setting `self.api_name`: ```python def __init__(self, config=None): super().__init__() self.api_name = "YourAPI" # ... other setup ... self.load_rate_limit_from_config(config) # Must be last ``` ### MISSING_VALUE Always use `MISSING_VALUE` from `scilex.constants` for missing fields — never use `None` or `""`: ```python from scilex.constants import MISSING_VALUE, is_valid title = paper.get('title') or MISSING_VALUE if is_valid(title): # field is present ``` ### Handling Dict vs List Responses Some APIs return a single result as a dict instead of a list when there is only one result. Always normalise: ```python results = data.get("results", []) if isinstance(results, dict): results = [results] # Wrap single result in a list ``` ### Error Handling ```python try: response = requests.get(url, timeout=30) response.raise_for_status() except requests.Timeout: print(f"Timeout on page {page}") break except requests.HTTPError as e: if e.response.status_code == 429: print("Rate limited") sleep(60) else: raise ``` ### Pagination Strategies ```python # Offset-based (DBLP, ISTEX style) def get_offset(self, page): return (page - 1) * self.max_by_page # 1-based page (OpenAlex, OpenAIRE style) def get_offset(self, page): return page # 0-based page (ORKG style) def get_offset(self, page): return page - 1 ``` ## Checklist Before submitting: - [ ] Collector inherits from `API_collector` - [ ] `api_name` matches the registration key exactly - [ ] `load_rate_limit_from_config()` called at end of `__init__` - [ ] `get_configurated_url()` returns a template with `{}` placeholder - [ ] `get_offset(page)` returns the correct value for this API's pagination style - [ ] `parsePageResults()` returns `{date_search, id_collect, page, total, results}` - [ ] Handles dict vs list in API responses (normalise to list) - [ ] Format converter uses `MISSING_VALUE` for all missing fields - [ ] Registered in both `api_collectors` and `FORMAT_CONVERTERS` dicts - [ ] Config examples added to `src/api.config.yml.example` - [ ] Test script created in `src/API tests/` - [ ] Unit tests added in `tests/` - [ ] Code formatted with `uvx ruff format .` ## Common Issues ### Case Sensitivity Ensure `api_name` matches the registration key and config value exactly. ### Missing Data Always use `MISSING_VALUE` for missing fields, never `None`. ### Rate Limits Start conservative, test with small batches first. ## Next Steps See [Architecture](architecture.md) for system design details.