SciLEx Web Interface

Complete web-based interface for SciLEx - combining a FastAPI REST backend with a Streamlit frontend for interactive paper collection and analysis.

Features

🎯 Core Functionality

  • Multi-Source Paper Collection: Search 10+ academic databases simultaneously

    • Free APIs: SemanticScholar, OpenAlex, Arxiv, PubMed, DBLP, HAL

    • Paid APIs: IEEE, Elsevier, Springer

    • Integration: Zotero, HuggingFace

  • Advanced Filtering:

    • By year, source, publication type

    • By abstract length and content

    • Relevance ranking by keywords

  • Configuration Management:

    • Easy API key configuration through web interface

    • Persistent storage of settings

    • Support for multiple API tiers

  • Results Management:

    • View statistics (papers by year, source, citations)

    • Export in multiple formats (CSV, BibTeX, JSON)

    • Interactive filtering of results

    • Pagination and search

  • Pipeline Tracking:

    • Background job management

    • Real-time progress updates

    • Job history and status monitoring

Demo video ⏯️

Scilex teaser video

Quick Start

Installation

Install required dependencies:

# If not already installed
uv add fastapi uvicorn streamlit pandas pyyaml

# Or use the main requirements.txt
pip install -e .

Running the Interface

Option 2: Run Only the API

python scilex/webapi/run_interface.py --api-only

API will be available at http://localhost:8000

  • Interactive API docs: http://localhost:8000/docs

Option 3: Run Only the Web Interface

python scilex/webapi/run_interface.py --web-only

Option 4: Custom Ports

python scilex/webapi/run_interface.py \
  --api-port 9000 \
  --web-port 8888 \
  --host 0.0.0.0

Usage Guide

1. Configure API Keys

  1. Click the ⚙️ Configuration section in the sidebar

  2. Expand 🔑 API Keys

  3. Select an API service

  4. Enter credentials:

    • SemanticScholar: API key (free tier available)

    • IEEE: API key

    • Elsevier: API key + optional institutional token

    • Springer: API key

    • Zotero: API key + User ID

    • HuggingFace: Access token

  5. Click ✅ Save API Configuration

2. Start a New Collection

  1. Go to 🔬 New Collection tab

  2. Fill in parameters:

    • Collection Name: Unique identifier

    • Years: Select publication years

    • Keywords: Enter search terms (one per line)

    • Data Sources: Choose APIs to search

    • Quality Filters: Set minimum standards

  3. Click 🚀 Start Collection Pipeline

3. View and Analyze Results

  1. Go to 📊 View Results tab

  2. Select a completed collection

  3. View statistics:

    • Papers by year (bar chart)

    • Papers by source (bar chart)

    • Total papers, year range, sources, average citations

  4. Browse papers with pagination

  5. Click on papers to view full details

4. Filter and Export

  1. Go to 🔍 Filter & Export tab

  2. Apply filters:

    • Year range

    • Data sources

    • Citation count range

    • Abstract length

  3. Click ✅ Apply Filters

  4. Export using:

    • CSV: Download filtered results

    • JSON: Structured format for processing

    • BibTeX: For citation management (if available)

5. View Collection History

Go to 📈 Collections History tab to see all past collections with:

  • Number of papers

  • File size

  • Creation date

API Reference

The FastAPI backend provides REST endpoints for programmatic access:

Configuration Endpoints

GET /api-config

Get current API configuration (with sensitive data masked)

POST /api-config
Content-Type: application/json

{
  "api_name": "SemanticScholar",
  "api_key": "YOUR_KEY"
}

Update API configuration

GET /available-apis

List all available APIs with descriptions

Collection Endpoints

POST /pipelines/start
Content-Type: application/json

{
  "collection_config": {
    "keywords": [["machine learning"], ["application"]],
    "years": [2023, 2024],
    "apis": ["SemanticScholar", "OpenAlex"],
    "collect_name": "ml_apps"
  },
  "api_config": {
    "SemanticScholar": {"api_key": "YOUR_KEY"}
  }
}

Start a new collection pipeline

GET /pipelines/{job_id}/status

Check status of a running pipeline

GET /pipelines

List all pipeline jobs

Results Endpoints

GET /results/{collect_name}?limit=100&skip=0

Get aggregated results with pagination

GET /results/{collect_name}/stats

Get statistics about results

POST /export
Content-Type: application/json

{
  "collect_name": "ml_apps",
  "format": "csv"
}

Export results (csv, json, bibtex)

GET /collections

List all available collections

Filtering Endpoints

POST /filter/{collect_name}
Content-Type: application/json

{
  "enable_itemtype_filter": true,
  "allowed_item_types": ["journalArticle", "conferencePaper"],
  "max_papers": 500
}

Apply filters to results

Examples

Python API Client

import requests
import json

BASE_URL = "http://localhost:8000"

# 1. Configure API keys
api_config = {
    "api_name": "SemanticScholar",
    "api_key": "YOUR_API_KEY"
}
requests.post(f"{BASE_URL}/api-config", json=api_config)

# 2. Start collection
pipeline_request = {
    "collection_config": {
        "keywords": [["large language model", "LLM"], ["evaluation"]],
        "years": [2023, 2024],
        "apis": ["SemanticScholar", "OpenAlex"],
        "collect_name": "llm_eval_2024"
    },
    "api_config": {
        "SemanticScholar": {"api_key": "YOUR_KEY"}
    }
}
response = requests.post(f"{BASE_URL}/pipelines/start", json=pipeline_request)
job_id = response.json()["job_id"]

# 3. Monitor progress
import time
while True:
    status = requests.get(f"{BASE_URL}/pipelines/{job_id}/status").json()
    print(f"Progress: {status['progress']}% - {status['message']}")
    if status['status'] in ['completed', 'failed']:
        break
    time.sleep(5)

# 4. Get results
results = requests.get(f"{BASE_URL}/results/llm_eval_2024").json()
print(f"Retrieved {results['total']} papers")

# 5. Export to CSV
export_request = {
    "collect_name": "llm_eval_2024",
    "format": "csv"
}
response = requests.post(f"{BASE_URL}/export", json=export_request)
with open("results.csv", "wb") as f:
    f.write(response.content)

Bash/cURL Examples

# Get available APIs
curl http://localhost:8000/available-apis

# Check API configuration
curl http://localhost:8000/api-config

# Update API key
curl -X POST http://localhost:8000/api-config \
  -H "Content-Type: application/json" \
  -d '{"api_name":"SemanticScholar","api_key":"YOUR_KEY"}'

# List collections
curl http://localhost:8000/collections

# Get collection statistics
curl http://localhost:8000/results/my_collection/stats

Configuration

Output Directory

Configure where results are saved:

  1. In web interface: Set in Configuration sidebar

  2. Via API: Pass output_dir in collection config

  3. Default: {project_root}/output

API Modes

SemanticScholar Modes:

  • regular: Standard search endpoint, 50-100 results per page (default)

  • bulk: Bulk search endpoint, 1000 results per page (requires approval)

Quality Filters

All supported filters:

  • enable_text_filter: Remove low-quality papers

  • min_abstract_words: Minimum abstract length (default: 50)

  • max_abstract_words: Maximum abstract length (default: 1000)

  • enable_itemtype_filter: Whitelist publication types

  • allowed_item_types: Types to allow (journalArticle, conferencePaper, etc.)

  • apply_relevance_ranking: Sort by keyword relevance

  • max_papers: Return top N papers by relevance

Advanced Usage

Custom Pipeline Scripts

For programmatic access without the web interface:

import sys
from pathlib import Path

# Setup path
PROJECT_ROOT = Path(__file__).parent.parent
sys.path.insert(0, str(PROJECT_ROOT / "src"))

from crawlers.collector_collection import CollectCollection

# Configure
main_config = {
    "keywords": [["machine learning"], ["application"]],
    "years": [2023, 2024],
    "apis": ["SemanticScholar", "OpenAlex"],
    "collect_name": "ml_apps"
}

api_config = {
    "SemanticScholar": {"api_key": "YOUR_KEY"}
}

# Run
collector = CollectCollection(main_config, api_config)
collector.create_collects_jobs()

See docs/user-guides/python-scripting.md for more details.

Integration with Other Tools

Export to Zotero:

from scilex.push_to_zotero import main as zotero_main
zotero_main()  # Requires Zotero API key configured

Export to BibTeX:

from scilex.export_to_bibtex import main as bibtex_main
bibtex_main()  # Creates .bib file

Troubleshooting

Common Issues

Port already in use:

python scilex/webapi/run_interface.py --api-port 9000 --web-port 8888

API keys not working:

  1. Verify credentials in web interface Configuration tab

  2. Check API website for current key format

  3. Ensure API is not rate-limited

  4. Look for error messages in terminal

Collection producing no results:

  1. Try simpler keywords

  2. Expand year range

  3. Add more data sources

  4. Check data source availability

  5. Verify API keys for paid services

Streamlit not opening in browser:

  • Manually visit http://localhost:8501

  • Check firewall settings

  • Try different browser

API documentation not loading:

  • Ensure API is running on correct port

  • Visit http://localhost:8000/docs (interactive)

  • Visit http://localhost:8000/openapi.json (raw schema)

Architecture

SciLEx Web Interface
├── Backend (FastAPI)
│   ├── /api-config - API key management
│   ├── /pipelines - Job management
│   ├── /results - Data retrieval
│   └── /export - Output handling
│
├── Frontend (Streamlit)
│   ├── New Collection - Pipeline setup
│   ├── View Results - Data exploration
│   ├── Filter & Export - Result refinement
│   ├── Collections History - Past runs
│   └── Help - Documentation
│
└── Core (Python)
    ├── CollectCollection - Multi-API aggregation
    ├── aggregate_collect - Deduplication & filtering
    └── export functions - Format conversion

File Structure

scilex/webapi/
├── __init__.py                  # Package initialization
├── scilex_api.py               # FastAPI backend
├── web_interface.py            # Streamlit frontend
├── run_interface.py            # Launch script
└── README.md                   # This file

Performance Tips

  1. Reduce API calls: Use fewer APIs, narrow year range

  2. Parallel aggregation: Increase --workers (default: 3)

  3. Skip citations: Use --skip-citations flag

  4. Batch operations: Combine multiple keywords into one search

  5. Filter early: Apply filters during collection when possible

Security Notes

  • API Keys: Stored in scilex/api.config.yml (add to .gitignore)

  • Sensitive Data: Masked in API responses

  • CORS: Enabled for all origins (configure in production)

  • Input Validation: All user inputs validated

  • HTTPS: Use reverse proxy (nginx) in production

Contributing

Want to improve the web interface?

  1. Report bugs in GitHub issues

  2. Submit feature requests

  3. Create pull requests with improvements

  4. Add endpoints as needed

License

Same as SciLEx main project - See LICENSE in project root

Support

  • Documentation: See docs/ directory

  • Issues: Report on GitHub

  • Questions: Create GitHub discussion

Changelog

v1.0.0

  • Initial release with FastAPI backend

  • Full Streamlit web interface

  • Support for all SciLEx features

  • Background job management

  • Multi-format export