Advanced Filtering Guide

SciLEx applies a 5-phase filtering pipeline to refine paper collections.

Filtering Pipeline

ItemType Filter - Keep specific publication types
Keyword Match - Verify search term relevance
Quality Score - Check metadata completeness
Citation Filter - Time-aware citation thresholds
Relevance Rank - Score and limit to top N papers

Phase 1: ItemType Filtering

Keep only specific publication types.

quality_filters:
  enable_itemtype_filter: true
  allowed_item_types:
    - journalArticle
    - conferencePaper
    - bookSection
    - book

Common types:

journalArticle - Peer-reviewed journals
conferencePaper - Conference proceedings
book - Academic books
bookSection - Book chapters
preprint - Pre-publication
thesis - Dissertations
report - Technical reports

Phase 2: Keyword Matching

Single Group (OR Logic)

Papers match ANY keyword:

keywords:
  - ["neural network", "deep learning", "CNN"]
  - []  # Empty

Dual Group (AND Logic)

Papers must match at least one from EACH group:

keywords:
  - ["climate", "weather"]         # Topic
  - ["prediction", "forecast"]     # Method

Phase 3: Quality Scoring

Scores metadata completeness (0-100):

Critical fields (5 pts each): DOI, title, authors, year
Important fields (3 pts each): abstract, journal, volume, issue
Nice-to-have (1 pt each): pages, URL, keywords

quality_filters:
  validate_abstracts: true
  min_abstract_quality_score: 60
  filter_by_abstract_quality: true

Phase 4: Citation Filtering

Time-aware thresholds based on paper age:

0-3 months: 0 citations required
3-6 months: 1+ required
6-12 months: 3+ required
12-24 months: 5-8+ required
24+ months: 10+ required

aggregate_get_citations: true

quality_filters:
  apply_citation_filter: true
  min_citations_per_year: 2  # Average per year

Phase 5: Relevance Ranking

Composite score combining:

Keyword frequency (45%)
Metadata quality (25%)
Publication type (20%)
Citation impact (10%)

quality_filters:
  apply_relevance_ranking: true
  max_papers: 500  # Keep top 500

  relevance_weights:
    keywords: 0.45
    quality: 0.25
    itemtype: 0.20
    citations: 0.10

Complete Configuration

keywords:
  - ["explainable AI", "XAI"]
  - ["healthcare", "medical"]

years: [2022, 2023, 2024]

apis:
  - SemanticScholar
  - OpenAlex

aggregate_get_citations: true

quality_filters:
  # Phase 1
  enable_itemtype_filter: true
  allowed_item_types:
    - journalArticle
    - conferencePaper

  # Phase 3
  validate_abstracts: true
  min_abstract_quality_score: 60
  filter_by_abstract_quality: true

  # Phase 4
  apply_citation_filter: true
  min_citations_per_year: 2

  # Phase 5
  apply_relevance_ranking: true
  max_papers: 300

  relevance_weights:
    keywords: 0.45
    quality: 0.25
    itemtype: 0.20
    citations: 0.10

Monitoring

Check the aggregation report:

Initial papers: 10,000
After ItemType: 7,000
After Keywords: 4,200
After Quality: 3,360
After Citations: 2,352
After Relevance: 300

Troubleshooting

Too Few Papers?

Relax keyword restrictions (use single group mode)
Lower quality thresholds
Disable citation filter

quality_filters:
  apply_citation_filter: false
  min_abstract_quality_score: 40

Too Many Papers?

Use dual keyword groups (AND logic)
Enable all filters
Set lower max_papers limit

Check Results

import pandas as pd

df = pd.read_csv('aggregated_data.csv', delimiter=';')

# Check scores
top = df.nlargest(10, 'relevance_score')
print(top[['title', 'relevance_score', 'quality_score', 'nb_citation']])