# SciLEx Web Interface Complete web-based interface for SciLEx - combining a FastAPI REST backend with a Streamlit frontend for interactive paper collection and analysis. ## Features ### 🎯 Core Functionality - **Multi-Source Paper Collection**: Search 10+ academic databases simultaneously - Free APIs: SemanticScholar, OpenAlex, Arxiv, PubMed, DBLP, HAL - Paid APIs: IEEE, Elsevier, Springer - Integration: Zotero, HuggingFace - **Advanced Filtering**: - By year, source, publication type - By abstract length and content - Relevance ranking by keywords - **Configuration Management**: - Easy API key configuration through web interface - Persistent storage of settings - Support for multiple API tiers - **Results Management**: - View statistics (papers by year, source, citations) - Export in multiple formats (CSV, BibTeX, JSON) - Interactive filtering of results - Pagination and search - **Pipeline Tracking**: - Background job management - Real-time progress updates - Job history and status monitoring ## Demo video ⏯️ [![Scilex teaser video](/img/scilex_web_interface.png)](https://youtu.be/FXZtrlOJ-vU) ## Quick Start ### Installation Install required dependencies: ```bash # If not already installed uv add fastapi uvicorn streamlit pandas pyyaml # Or use the main requirements.txt pip install -e . ``` ### Running the Interface #### Option 1: Run Both API and Web Interface (Recommended) ```bash python scilex/webapi/run_interface.py ``` This will: - Start FastAPI backend on `http://localhost:8000` - Start Streamlit on `http://localhost:8501` - Automatically open the web interface in your browser #### Option 2: Run Only the API ```bash python scilex/webapi/run_interface.py --api-only ``` API will be available at `http://localhost:8000` - Interactive API docs: `http://localhost:8000/docs` #### Option 3: Run Only the Web Interface ```bash python scilex/webapi/run_interface.py --web-only ``` #### Option 4: Custom Ports ```bash python scilex/webapi/run_interface.py \ --api-port 9000 \ --web-port 8888 \ --host 0.0.0.0 ``` ## Usage Guide ### 1. Configure API Keys 1. Click the **⚙️ Configuration** section in the sidebar 2. Expand **🔑 API Keys** 3. Select an API service 4. Enter credentials: - **SemanticScholar**: API key (free tier available) - **IEEE**: API key - **Elsevier**: API key + optional institutional token - **Springer**: API key - **Zotero**: API key + User ID - **HuggingFace**: Access token 5. Click **✅ Save API Configuration** ### 2. Start a New Collection 1. Go to **🔬 New Collection** tab 2. Fill in parameters: - **Collection Name**: Unique identifier - **Years**: Select publication years - **Keywords**: Enter search terms (one per line) - **Data Sources**: Choose APIs to search - **Quality Filters**: Set minimum standards 3. Click **🚀 Start Collection Pipeline** ### 3. View and Analyze Results 1. Go to **📊 View Results** tab 2. Select a completed collection 3. View statistics: - Papers by year (bar chart) - Papers by source (bar chart) - Total papers, year range, sources, average citations 4. Browse papers with pagination 5. Click on papers to view full details ### 4. Filter and Export 1. Go to **🔍 Filter & Export** tab 2. Apply filters: - Year range - Data sources - Citation count range - Abstract length 3. Click **✅ Apply Filters** 4. Export using: - **CSV**: Download filtered results - **JSON**: Structured format for processing - **BibTeX**: For citation management (if available) ### 5. View Collection History Go to **📈 Collections History** tab to see all past collections with: - Number of papers - File size - Creation date ## API Reference The FastAPI backend provides REST endpoints for programmatic access: ### Configuration Endpoints ```http GET /api-config ``` Get current API configuration (with sensitive data masked) ```http POST /api-config Content-Type: application/json { "api_name": "SemanticScholar", "api_key": "YOUR_KEY" } ``` Update API configuration ```http GET /available-apis ``` List all available APIs with descriptions ### Collection Endpoints ```http POST /pipelines/start Content-Type: application/json { "collection_config": { "keywords": [["machine learning"], ["application"]], "years": [2023, 2024], "apis": ["SemanticScholar", "OpenAlex"], "collect_name": "ml_apps" }, "api_config": { "SemanticScholar": {"api_key": "YOUR_KEY"} } } ``` Start a new collection pipeline ```http GET /pipelines/{job_id}/status ``` Check status of a running pipeline ```http GET /pipelines ``` List all pipeline jobs ### Results Endpoints ```http GET /results/{collect_name}?limit=100&skip=0 ``` Get aggregated results with pagination ```http GET /results/{collect_name}/stats ``` Get statistics about results ```http POST /export Content-Type: application/json { "collect_name": "ml_apps", "format": "csv" } ``` Export results (csv, json, bibtex) ```http GET /collections ``` List all available collections ### Filtering Endpoints ```http POST /filter/{collect_name} Content-Type: application/json { "enable_itemtype_filter": true, "allowed_item_types": ["journalArticle", "conferencePaper"], "max_papers": 500 } ``` Apply filters to results ## Examples ### Python API Client ```python import requests import json BASE_URL = "http://localhost:8000" # 1. Configure API keys api_config = { "api_name": "SemanticScholar", "api_key": "YOUR_API_KEY" } requests.post(f"{BASE_URL}/api-config", json=api_config) # 2. Start collection pipeline_request = { "collection_config": { "keywords": [["large language model", "LLM"], ["evaluation"]], "years": [2023, 2024], "apis": ["SemanticScholar", "OpenAlex"], "collect_name": "llm_eval_2024" }, "api_config": { "SemanticScholar": {"api_key": "YOUR_KEY"} } } response = requests.post(f"{BASE_URL}/pipelines/start", json=pipeline_request) job_id = response.json()["job_id"] # 3. Monitor progress import time while True: status = requests.get(f"{BASE_URL}/pipelines/{job_id}/status").json() print(f"Progress: {status['progress']}% - {status['message']}") if status['status'] in ['completed', 'failed']: break time.sleep(5) # 4. Get results results = requests.get(f"{BASE_URL}/results/llm_eval_2024").json() print(f"Retrieved {results['total']} papers") # 5. Export to CSV export_request = { "collect_name": "llm_eval_2024", "format": "csv" } response = requests.post(f"{BASE_URL}/export", json=export_request) with open("results.csv", "wb") as f: f.write(response.content) ``` ### Bash/cURL Examples ```bash # Get available APIs curl http://localhost:8000/available-apis # Check API configuration curl http://localhost:8000/api-config # Update API key curl -X POST http://localhost:8000/api-config \ -H "Content-Type: application/json" \ -d '{"api_name":"SemanticScholar","api_key":"YOUR_KEY"}' # List collections curl http://localhost:8000/collections # Get collection statistics curl http://localhost:8000/results/my_collection/stats ``` ## Configuration ### Output Directory Configure where results are saved: 1. In web interface: Set in **Configuration** sidebar 2. Via API: Pass `output_dir` in collection config 3. Default: `{project_root}/output` ### API Modes **SemanticScholar Modes:** - `regular`: Standard search endpoint, 50-100 results per page (default) - `bulk`: Bulk search endpoint, 1000 results per page (requires approval) ### Quality Filters All supported filters: - `enable_text_filter`: Remove low-quality papers - `min_abstract_words`: Minimum abstract length (default: 50) - `max_abstract_words`: Maximum abstract length (default: 1000) - `enable_itemtype_filter`: Whitelist publication types - `allowed_item_types`: Types to allow (journalArticle, conferencePaper, etc.) - `apply_relevance_ranking`: Sort by keyword relevance - `max_papers`: Return top N papers by relevance ## Advanced Usage ### Custom Pipeline Scripts For programmatic access without the web interface: ```python import sys from pathlib import Path # Setup path PROJECT_ROOT = Path(__file__).parent.parent sys.path.insert(0, str(PROJECT_ROOT / "src")) from crawlers.collector_collection import CollectCollection # Configure main_config = { "keywords": [["machine learning"], ["application"]], "years": [2023, 2024], "apis": ["SemanticScholar", "OpenAlex"], "collect_name": "ml_apps" } api_config = { "SemanticScholar": {"api_key": "YOUR_KEY"} } # Run collector = CollectCollection(main_config, api_config) collector.create_collects_jobs() ``` See [docs/user-guides/python-scripting.md](../../docs/user-guides/python-scripting.md) for more details. ### Integration with Other Tools **Export to Zotero:** ```python from scilex.push_to_zotero import main as zotero_main zotero_main() # Requires Zotero API key configured ``` **Export to BibTeX:** ```python from scilex.export_to_bibtex import main as bibtex_main bibtex_main() # Creates .bib file ``` ## Troubleshooting ### Common Issues **Port already in use:** ```bash python scilex/webapi/run_interface.py --api-port 9000 --web-port 8888 ``` **API keys not working:** 1. Verify credentials in web interface **Configuration** tab 2. Check API website for current key format 3. Ensure API is not rate-limited 4. Look for error messages in terminal **Collection producing no results:** 1. Try simpler keywords 2. Expand year range 3. Add more data sources 4. Check data source availability 5. Verify API keys for paid services **Streamlit not opening in browser:** - Manually visit `http://localhost:8501` - Check firewall settings - Try different browser **API documentation not loading:** - Ensure API is running on correct port - Visit `http://localhost:8000/docs` (interactive) - Visit `http://localhost:8000/openapi.json` (raw schema) ## Architecture ``` SciLEx Web Interface ├── Backend (FastAPI) │ ├── /api-config - API key management │ ├── /pipelines - Job management │ ├── /results - Data retrieval │ └── /export - Output handling │ ├── Frontend (Streamlit) │ ├── New Collection - Pipeline setup │ ├── View Results - Data exploration │ ├── Filter & Export - Result refinement │ ├── Collections History - Past runs │ └── Help - Documentation │ └── Core (Python) ├── CollectCollection - Multi-API aggregation ├── aggregate_collect - Deduplication & filtering └── export functions - Format conversion ``` ## File Structure ``` scilex/webapi/ ├── __init__.py # Package initialization ├── scilex_api.py # FastAPI backend ├── web_interface.py # Streamlit frontend ├── run_interface.py # Launch script └── README.md # This file ``` ## Performance Tips 1. **Reduce API calls**: Use fewer APIs, narrow year range 2. **Parallel aggregation**: Increase `--workers` (default: 3) 3. **Skip citations**: Use `--skip-citations` flag 4. **Batch operations**: Combine multiple keywords into one search 5. **Filter early**: Apply filters during collection when possible ## Security Notes - **API Keys**: Stored in `scilex/api.config.yml` (add to `.gitignore`) - **Sensitive Data**: Masked in API responses - **CORS**: Enabled for all origins (configure in production) - **Input Validation**: All user inputs validated - **HTTPS**: Use reverse proxy (nginx) in production ## Contributing Want to improve the web interface? 1. Report bugs in GitHub issues 2. Submit feature requests 3. Create pull requests with improvements 4. Add endpoints as needed ## License Same as SciLEx main project - See `LICENSE` in project root ## Support - **Documentation**: See [docs/](../../docs/) directory - **Issues**: Report on GitHub - **Questions**: Create GitHub discussion ## Changelog ### v1.0.0 - Initial release with FastAPI backend - Full Streamlit web interface - Support for all SciLEx features - Background job management - Multi-format export