SciLEx Web Interface

Complete web-based interface for SciLEx - combining a FastAPI REST backend with a Streamlit frontend for interactive paper collection and analysis.

Features

🎯 Core Functionality

Multi-Source Paper Collection: Search 10+ academic databases simultaneously
- Free APIs: SemanticScholar, OpenAlex, Arxiv, PubMed, DBLP, HAL
- Paid APIs: IEEE, Elsevier, Springer
- Integration: Zotero, HuggingFace
Advanced Filtering:
- By year, source, publication type
- By abstract length and content
- Relevance ranking by keywords
Configuration Management:
- Easy API key configuration through web interface
- Persistent storage of settings
- Support for multiple API tiers
Results Management:
- View statistics (papers by year, source, citations)
- Export in multiple formats (CSV, BibTeX, JSON)
- Interactive filtering of results
- Pagination and search
Pipeline Tracking:
- Background job management
- Real-time progress updates
- Job history and status monitoring

Demo video ⏯️

Quick Start

Installation

Install required dependencies:

# If not already installed
uv add fastapi uvicorn streamlit pandas pyyaml

# Or use the main requirements.txt
pip install -e .

Running the Interface

Option 1: Run Both API and Web Interface (Recommended)

python scilex/webapi/run_interface.py

This will:

Start FastAPI backend on http://localhost:8000
Start Streamlit on http://localhost:8501
Automatically open the web interface in your browser

Option 2: Run Only the API

python scilex/webapi/run_interface.py --api-only

API will be available at http://localhost:8000

Interactive API docs: http://localhost:8000/docs

Option 3: Run Only the Web Interface

python scilex/webapi/run_interface.py --web-only

Option 4: Custom Ports

python scilex/webapi/run_interface.py \
  --api-port 9000 \
  --web-port 8888 \
  --host 0.0.0.0

Usage Guide

1. Configure API Keys

Click the ⚙️ Configuration section in the sidebar
Expand 🔑 API Keys
Select an API service
Enter credentials:
- SemanticScholar: API key (free tier available)
- IEEE: API key
- Elsevier: API key + optional institutional token
- Springer: API key
- Zotero: API key + User ID
- HuggingFace: Access token
Click ✅ Save API Configuration

2. Start a New Collection

Go to 🔬 New Collection tab
Fill in parameters:
- Collection Name: Unique identifier
- Years: Select publication years
- Keywords: Enter search terms (one per line)
- Data Sources: Choose APIs to search
- Quality Filters: Set minimum standards
Click 🚀 Start Collection Pipeline

3. View and Analyze Results

Go to 📊 View Results tab
Select a completed collection
View statistics:
- Papers by year (bar chart)
- Papers by source (bar chart)
- Total papers, year range, sources, average citations
Browse papers with pagination
Click on papers to view full details

4. Filter and Export

Go to 🔍 Filter & Export tab
Apply filters:
- Year range
- Data sources
- Citation count range
- Abstract length
Click ✅ Apply Filters
Export using:
- CSV: Download filtered results
- JSON: Structured format for processing
- BibTeX: For citation management (if available)

5. View Collection History

Go to 📈 Collections History tab to see all past collections with:

Number of papers
File size
Creation date

API Reference

The FastAPI backend provides REST endpoints for programmatic access:

Configuration Endpoints

GET /api-config

Get current API configuration (with sensitive data masked)

POST /api-config
Content-Type: application/json

{
  "api_name": "SemanticScholar",
  "api_key": "YOUR_KEY"
}

Update API configuration

GET /available-apis

List all available APIs with descriptions

Collection Endpoints

POST /pipelines/start
Content-Type: application/json

{
  "collection_config": {
    "keywords": [["machine learning"], ["application"]],
    "years": [2023, 2024],
    "apis": ["SemanticScholar", "OpenAlex"],
    "collect_name": "ml_apps"
  },
  "api_config": {
    "SemanticScholar": {"api_key": "YOUR_KEY"}
  }
}

Start a new collection pipeline

GET /pipelines/{job_id}/status

Check status of a running pipeline

GET /pipelines

List all pipeline jobs

Results Endpoints

GET /results/{collect_name}?limit=100&skip=0

Get aggregated results with pagination

GET /results/{collect_name}/stats

Get statistics about results

POST /export
Content-Type: application/json

{
  "collect_name": "ml_apps",
  "format": "csv"
}

Export results (csv, json, bibtex)

GET /collections

List all available collections

Filtering Endpoints

POST /filter/{collect_name}
Content-Type: application/json

{
  "enable_itemtype_filter": true,
  "allowed_item_types": ["journalArticle", "conferencePaper"],
  "max_papers": 500
}

Apply filters to results

Examples

Python API Client

import requests
import json

BASE_URL = "http://localhost:8000"

# 1. Configure API keys
api_config = {
    "api_name": "SemanticScholar",
    "api_key": "YOUR_API_KEY"
}
requests.post(f"{BASE_URL}/api-config", json=api_config)

# 2. Start collection
pipeline_request = {
    "collection_config": {
        "keywords": [["large language model", "LLM"], ["evaluation"]],
        "years": [2023, 2024],
        "apis": ["SemanticScholar", "OpenAlex"],
        "collect_name": "llm_eval_2024"
    },
    "api_config": {
        "SemanticScholar": {"api_key": "YOUR_KEY"}
    }
}
response = requests.post(f"{BASE_URL}/pipelines/start", json=pipeline_request)
job_id = response.json()["job_id"]

# 3. Monitor progress
import time
while True:
    status = requests.get(f"{BASE_URL}/pipelines/{job_id}/status").json()
    print(f"Progress: {status['progress']}% - {status['message']}")
    if status['status'] in ['completed', 'failed']:
        break
    time.sleep(5)

# 4. Get results
results = requests.get(f"{BASE_URL}/results/llm_eval_2024").json()
print(f"Retrieved {results['total']} papers")

# 5. Export to CSV
export_request = {
    "collect_name": "llm_eval_2024",
    "format": "csv"
}
response = requests.post(f"{BASE_URL}/export", json=export_request)
with open("results.csv", "wb") as f:
    f.write(response.content)

Bash/cURL Examples

# Get available APIs
curl http://localhost:8000/available-apis

# Check API configuration
curl http://localhost:8000/api-config

# Update API key
curl -X POST http://localhost:8000/api-config \
  -H "Content-Type: application/json" \
  -d '{"api_name":"SemanticScholar","api_key":"YOUR_KEY"}'

# List collections
curl http://localhost:8000/collections

# Get collection statistics
curl http://localhost:8000/results/my_collection/stats

Configuration

Output Directory

Configure where results are saved:

In web interface: Set in Configuration sidebar
Via API: Pass output_dir in collection config
Default: {project_root}/output

API Modes

SemanticScholar Modes:

regular: Standard search endpoint, 50-100 results per page (default)
bulk: Bulk search endpoint, 1000 results per page (requires approval)

Quality Filters

All supported filters:

enable_text_filter: Remove low-quality papers
min_abstract_words: Minimum abstract length (default: 50)
max_abstract_words: Maximum abstract length (default: 1000)
enable_itemtype_filter: Whitelist publication types
allowed_item_types: Types to allow (journalArticle, conferencePaper, etc.)
apply_relevance_ranking: Sort by keyword relevance
max_papers: Return top N papers by relevance

Advanced Usage

Custom Pipeline Scripts

For programmatic access without the web interface:

import sys
from pathlib import Path

# Setup path
PROJECT_ROOT = Path(__file__).parent.parent
sys.path.insert(0, str(PROJECT_ROOT / "src"))

from crawlers.collector_collection import CollectCollection

# Configure
main_config = {
    "keywords": [["machine learning"], ["application"]],
    "years": [2023, 2024],
    "apis": ["SemanticScholar", "OpenAlex"],
    "collect_name": "ml_apps"
}

api_config = {
    "SemanticScholar": {"api_key": "YOUR_KEY"}
}

# Run
collector = CollectCollection(main_config, api_config)
collector.create_collects_jobs()

See docs/user-guides/python-scripting.md for more details.

Integration with Other Tools

Export to Zotero:

from scilex.push_to_zotero import main as zotero_main
zotero_main()  # Requires Zotero API key configured

Export to BibTeX:

from scilex.export_to_bibtex import main as bibtex_main
bibtex_main()  # Creates .bib file

Troubleshooting

Common Issues

Port already in use:

python scilex/webapi/run_interface.py --api-port 9000 --web-port 8888

API keys not working:

Verify credentials in web interface Configuration tab
Check API website for current key format
Ensure API is not rate-limited
Look for error messages in terminal

Collection producing no results:

Try simpler keywords
Expand year range
Add more data sources
Check data source availability
Verify API keys for paid services

Streamlit not opening in browser:

Manually visit http://localhost:8501
Check firewall settings
Try different browser

API documentation not loading:

Ensure API is running on correct port
Visit http://localhost:8000/docs (interactive)
Visit http://localhost:8000/openapi.json (raw schema)

Architecture

SciLEx Web Interface
├── Backend (FastAPI)
│   ├── /api-config - API key management
│   ├── /pipelines - Job management
│   ├── /results - Data retrieval
│   └── /export - Output handling
│
├── Frontend (Streamlit)
│   ├── New Collection - Pipeline setup
│   ├── View Results - Data exploration
│   ├── Filter & Export - Result refinement
│   ├── Collections History - Past runs
│   └── Help - Documentation
│
└── Core (Python)
    ├── CollectCollection - Multi-API aggregation
    ├── aggregate_collect - Deduplication & filtering
    └── export functions - Format conversion

File Structure

scilex/webapi/
├── __init__.py                  # Package initialization
├── scilex_api.py               # FastAPI backend
├── web_interface.py            # Streamlit frontend
├── run_interface.py            # Launch script
└── README.md                   # This file

Performance Tips

Reduce API calls: Use fewer APIs, narrow year range
Parallel aggregation: Increase --workers (default: 3)
Skip citations: Use --skip-citations flag
Batch operations: Combine multiple keywords into one search
Filter early: Apply filters during collection when possible

Security Notes

API Keys: Stored in scilex/api.config.yml (add to .gitignore)
Sensitive Data: Masked in API responses
CORS: Enabled for all origins (configure in production)
Input Validation: All user inputs validated
HTTPS: Use reverse proxy (nginx) in production

Contributing

Want to improve the web interface?

Report bugs in GitHub issues
Submit feature requests
Create pull requests with improvements
Add endpoints as needed

License

Same as SciLEx main project - See LICENSE in project root

Support

Documentation: See docs/ directory
Issues: Report on GitHub
Questions: Create GitHub discussion

Changelog

v1.0.0

Initial release with FastAPI backend
Full Streamlit web interface
Support for all SciLEx features
Background job management
Multi-format export