Advanced Filtering Guide
SciLEx applies a 5-phase filtering pipeline to refine paper collections.
Filtering Pipeline
ItemType Filter - Keep specific publication types
Keyword Match - Verify search term relevance
Quality Score - Check metadata completeness
Citation Filter - Time-aware citation thresholds
Relevance Rank - Score and limit to top N papers
Phase 1: ItemType Filtering
Keep only specific publication types.
quality_filters:
enable_itemtype_filter: true
allowed_item_types:
- journalArticle
- conferencePaper
- bookSection
- book
Common types:
journalArticle- Peer-reviewed journalsconferencePaper- Conference proceedingsbook- Academic booksbookSection- Book chapterspreprint- Pre-publicationthesis- Dissertationsreport- Technical reports
Phase 2: Keyword Matching
Single Group (OR Logic)
Papers match ANY keyword:
keywords:
- ["neural network", "deep learning", "CNN"]
- [] # Empty
Dual Group (AND Logic)
Papers must match at least one from EACH group:
keywords:
- ["climate", "weather"] # Topic
- ["prediction", "forecast"] # Method
Phase 3: Quality Scoring
Scores metadata completeness (0-100):
Critical fields (5 pts each): DOI, title, authors, year
Important fields (3 pts each): abstract, journal, volume, issue
Nice-to-have (1 pt each): pages, URL, keywords
quality_filters:
validate_abstracts: true
min_abstract_quality_score: 60
filter_by_abstract_quality: true
Phase 4: Citation Filtering
Time-aware thresholds based on paper age:
0-3 months: 0 citations required
3-6 months: 1+ required
6-12 months: 3+ required
12-24 months: 5-8+ required
24+ months: 10+ required
aggregate_get_citations: true
quality_filters:
apply_citation_filter: true
min_citations_per_year: 2 # Average per year
Phase 5: Relevance Ranking
Composite score combining:
Keyword frequency (45%)
Metadata quality (25%)
Publication type (20%)
Citation impact (10%)
quality_filters:
apply_relevance_ranking: true
max_papers: 500 # Keep top 500
relevance_weights:
keywords: 0.45
quality: 0.25
itemtype: 0.20
citations: 0.10
Complete Configuration
keywords:
- ["explainable AI", "XAI"]
- ["healthcare", "medical"]
years: [2022, 2023, 2024]
apis:
- SemanticScholar
- OpenAlex
aggregate_get_citations: true
quality_filters:
# Phase 1
enable_itemtype_filter: true
allowed_item_types:
- journalArticle
- conferencePaper
# Phase 3
validate_abstracts: true
min_abstract_quality_score: 60
filter_by_abstract_quality: true
# Phase 4
apply_citation_filter: true
min_citations_per_year: 2
# Phase 5
apply_relevance_ranking: true
max_papers: 300
relevance_weights:
keywords: 0.45
quality: 0.25
itemtype: 0.20
citations: 0.10
Monitoring
Check the aggregation report:
Initial papers: 10,000
After ItemType: 7,000
After Keywords: 4,200
After Quality: 3,360
After Citations: 2,352
After Relevance: 300
Troubleshooting
Too Few Papers?
Relax keyword restrictions (use single group mode)
Lower quality thresholds
Disable citation filter
quality_filters:
apply_citation_filter: false
min_abstract_quality_score: 40
Too Many Papers?
Use dual keyword groups (AND logic)
Enable all filters
Set lower
max_paperslimit
Check Results
import pandas as pd
df = pd.read_csv('aggregated_data.csv', delimiter=';')
# Check scores
top = df.nlargest(10, 'relevance_score')
print(top[['title', 'relevance_score', 'quality_score', 'nb_citation']])