# Basic Workflow Guide Standard workflow for collecting and aggregating papers. ## Workflow Steps 1. **Collection** - Query APIs and download metadata 2. **Aggregation** - Deduplicate and filter 3. **Export** - Push to Zotero (optional) ## Step 1: Collection ### Configure Search Edit `src/scilex.config.yml`: ```yaml keywords: - ["machine learning"] - [] years: [2023, 2024] apis: - SemanticScholar - OpenAlex fields: ["title", "abstract"] ``` ### Run Collection ```bash uv run python src/run_collecte.py ``` Results saved to `output/collect_YYYYMMDD_HHMMSS/` Output structure: ``` output/collect_20241113_143022/ ├── config_used.yml ├── SemanticScholar/ │ ├── 0/ # Query 0: keyword[0] + year[0] │ │ ├── page_1 │ │ └── page_2 ├── OpenAlex/ ``` ### Idempotent Behavior Re-running collection skips already completed queries. Safe to re-run without wasting API quotas. ## Step 2: Aggregation ### Basic Aggregation ```bash uv run python src/aggregate_collect.py ``` Process: 1. Loads JSON files 2. Converts to unified format 3. Deduplicates (DOI, URL, fuzzy title) 4. Applies keyword filtering 5. Scores quality 6. Saves to CSV ### With Citations Enable in `src/scilex.config.yml`: ```yaml aggregate_get_citations: true ``` Then run aggregation. Citations are fetched from cache → Semantic Scholar → OpenCitations. ### Output CSV saved to `output/collect_*/aggregated_data.csv` Columns: - `title`, `authors`, `year`, `DOI`, `abstract` - `itemType` - Publication type - `publicationTitle` - Journal/conference - `citation_count` - Citations (if enabled) - `quality_score` - Metadata completeness (0-100) - `relevance_score` - Relevance (0-10) ## Step 3: Export to Zotero ### Configure Edit `scilex/api.config.yml`: ```yaml zotero: api_key: "your-key" user_mode: "user" # or "group" ``` ### Run Export ```bash uv run python src/push_to_Zotero_collect.py ``` Papers uploaded in batches. Duplicates skipped by URL. ## Filtering Pipeline Aggregation applies filters: 1. **ItemType** - Keep allowed publication types 2. **Keywords** - Match search terms 3. **Deduplication** - Remove duplicates 4. **Quality** - Remove low-quality metadata 5. **Citations** - Time-aware thresholds 6. **Relevance** - Score and limit to top N Check logs to see papers filtered at each step. ## Complete Example ```yaml # src/scilex.config.yml keywords: - ["knowledge graph"] - ["LLM", "large language model"] years: [2023, 2024] apis: - SemanticScholar - OpenAlex aggregate_get_citations: true quality_filters: enable_itemtype_filter: true allowed_item_types: - journalArticle - conferencePaper apply_relevance_ranking: true max_papers: 300 ``` Run: ```bash uv run python src/run_collecte.py uv run python src/aggregate_collect.py uv run python src/push_to_Zotero_collect.py ``` ## Analyze Results ```python import pandas as pd df = pd.read_csv('output/collect_*/aggregated_data.csv', delimiter=';') print(f"Total papers: {len(df)}") print(f"\nPapers by year:") print(df['year'].value_counts().sort_index()) print(f"\nTop cited:") print(df.nlargest(10, 'nb_citation')[['title', 'nb_citation']]) ``` ## Log Levels ```bash # Default (clean output) uv run python src/run_collecte.py # Detailed progress LOG_LEVEL=INFO uv run python src/run_collecte.py # Full debugging LOG_LEVEL=DEBUG uv run python src/run_collecte.py ``` ## Next Steps - [Advanced Filtering](advanced-filtering.md) - Filtering options - [Configuration](../getting-started/configuration.md) - All config parameters - [API Comparison](../reference/api-comparison.md) - API details