Advanced Filtering Guide

SciLEx applies a 5-phase filtering pipeline to refine paper collections.

Filtering Pipeline

  1. ItemType Filter - Keep specific publication types

  2. Keyword Match - Verify search term relevance

  3. Quality Score - Check metadata completeness

  4. Citation Filter - Time-aware citation thresholds

  5. Relevance Rank - Score and limit to top N papers

Phase 1: ItemType Filtering

Keep only specific publication types.

quality_filters:
  enable_itemtype_filter: true
  allowed_item_types:
    - journalArticle
    - conferencePaper
    - bookSection
    - book

Common types:

  • journalArticle - Peer-reviewed journals

  • conferencePaper - Conference proceedings

  • book - Academic books

  • bookSection - Book chapters

  • preprint - Pre-publication

  • thesis - Dissertations

  • report - Technical reports

Phase 2: Keyword Matching

Single Group (OR Logic)

Papers match ANY keyword:

keywords:
  - ["neural network", "deep learning", "CNN"]
  - []  # Empty

Dual Group (AND Logic)

Papers must match at least one from EACH group:

keywords:
  - ["climate", "weather"]         # Topic
  - ["prediction", "forecast"]     # Method

Phase 3: Quality Scoring

Scores metadata completeness (0-100):

  • Critical fields (5 pts each): DOI, title, authors, year

  • Important fields (3 pts each): abstract, journal, volume, issue

  • Nice-to-have (1 pt each): pages, URL, keywords

quality_filters:
  validate_abstracts: true
  min_abstract_quality_score: 60
  filter_by_abstract_quality: true

Phase 4: Citation Filtering

Time-aware thresholds based on paper age:

  • 0-3 months: 0 citations required

  • 3-6 months: 1+ required

  • 6-12 months: 3+ required

  • 12-24 months: 5-8+ required

  • 24+ months: 10+ required

aggregate_get_citations: true

quality_filters:
  apply_citation_filter: true
  min_citations_per_year: 2  # Average per year

Phase 5: Relevance Ranking

Composite score combining:

  • Keyword frequency (45%)

  • Metadata quality (25%)

  • Publication type (20%)

  • Citation impact (10%)

quality_filters:
  apply_relevance_ranking: true
  max_papers: 500  # Keep top 500

  relevance_weights:
    keywords: 0.45
    quality: 0.25
    itemtype: 0.20
    citations: 0.10

Complete Configuration

keywords:
  - ["explainable AI", "XAI"]
  - ["healthcare", "medical"]

years: [2022, 2023, 2024]

apis:
  - SemanticScholar
  - OpenAlex

aggregate_get_citations: true

quality_filters:
  # Phase 1
  enable_itemtype_filter: true
  allowed_item_types:
    - journalArticle
    - conferencePaper

  # Phase 3
  validate_abstracts: true
  min_abstract_quality_score: 60
  filter_by_abstract_quality: true

  # Phase 4
  apply_citation_filter: true
  min_citations_per_year: 2

  # Phase 5
  apply_relevance_ranking: true
  max_papers: 300

  relevance_weights:
    keywords: 0.45
    quality: 0.25
    itemtype: 0.20
    citations: 0.10

Monitoring

Check the aggregation report:

Initial papers: 10,000
After ItemType: 7,000
After Keywords: 4,200
After Quality: 3,360
After Citations: 2,352
After Relevance: 300

Troubleshooting

Too Few Papers?

  1. Relax keyword restrictions (use single group mode)

  2. Lower quality thresholds

  3. Disable citation filter

quality_filters:
  apply_citation_filter: false
  min_abstract_quality_score: 40

Too Many Papers?

  1. Use dual keyword groups (AND logic)

  2. Enable all filters

  3. Set lower max_papers limit

Check Results

import pandas as pd

df = pd.read_csv('aggregated_data.csv', delimiter=';')

# Check scores
top = df.nlargest(10, 'relevance_score')
print(top[['title', 'relevance_score', 'quality_score', 'nb_citation']])