# Advanced Filtering Guide

SciLEx applies a 5-phase filtering pipeline to refine paper collections.

## Filtering Pipeline

1. **ItemType Filter** - Keep specific publication types
2. **Keyword Match** - Verify search term relevance
3. **Quality Score** - Check metadata completeness
4. **Citation Filter** - Time-aware citation thresholds
5. **Relevance Rank** - Score and limit to top N papers

## Phase 1: ItemType Filtering

Keep only specific publication types.

```yaml
quality_filters:
  enable_itemtype_filter: true
  allowed_item_types:
    - journalArticle
    - conferencePaper
    - bookSection
    - book
```

Common types:
- `journalArticle` - Peer-reviewed journals
- `conferencePaper` - Conference proceedings
- `book` - Academic books
- `bookSection` - Book chapters
- `preprint` - Pre-publication
- `thesis` - Dissertations
- `report` - Technical reports

## Phase 2: Keyword Matching

### Single Group (OR Logic)

Papers match ANY keyword:

```yaml
keywords:
  - ["neural network", "deep learning", "CNN"]
  - []  # Empty
```

### Dual Group (AND Logic)

Papers must match at least one from EACH group:

```yaml
keywords:
  - ["climate", "weather"]         # Topic
  - ["prediction", "forecast"]     # Method
```

## Phase 3: Quality Scoring

Scores metadata completeness (0-100):

- Critical fields (5 pts each): DOI, title, authors, year
- Important fields (3 pts each): abstract, journal, volume, issue
- Nice-to-have (1 pt each): pages, URL, keywords

```yaml
quality_filters:
  validate_abstracts: true
  min_abstract_quality_score: 60
  filter_by_abstract_quality: true
```

## Phase 4: Citation Filtering

Time-aware thresholds based on paper age:

- 0-3 months: 0 citations required
- 3-6 months: 1+ required
- 6-12 months: 3+ required
- 12-24 months: 5-8+ required
- 24+ months: 10+ required

```yaml
aggregate_get_citations: true

quality_filters:
  apply_citation_filter: true
  min_citations_per_year: 2  # Average per year
```

## Phase 5: Relevance Ranking

Composite score combining:
- Keyword frequency (45%)
- Metadata quality (25%)
- Publication type (20%)
- Citation impact (10%)

```yaml
quality_filters:
  apply_relevance_ranking: true
  max_papers: 500  # Keep top 500

  relevance_weights:
    keywords: 0.45
    quality: 0.25
    itemtype: 0.20
    citations: 0.10
```

## Complete Configuration

```yaml
keywords:
  - ["explainable AI", "XAI"]
  - ["healthcare", "medical"]

years: [2022, 2023, 2024]

apis:
  - SemanticScholar
  - OpenAlex

aggregate_get_citations: true

quality_filters:
  # Phase 1
  enable_itemtype_filter: true
  allowed_item_types:
    - journalArticle
    - conferencePaper

  # Phase 3
  validate_abstracts: true
  min_abstract_quality_score: 60
  filter_by_abstract_quality: true

  # Phase 4
  apply_citation_filter: true
  min_citations_per_year: 2

  # Phase 5
  apply_relevance_ranking: true
  max_papers: 300

  relevance_weights:
    keywords: 0.45
    quality: 0.25
    itemtype: 0.20
    citations: 0.10
```

## Monitoring

Check the aggregation report:

```
Initial papers: 10,000
After ItemType: 7,000
After Keywords: 4,200
After Quality: 3,360
After Citations: 2,352
After Relevance: 300
```

## Troubleshooting

### Too Few Papers?

1. Relax keyword restrictions (use single group mode)
2. Lower quality thresholds
3. Disable citation filter

```yaml
quality_filters:
  apply_citation_filter: false
  min_abstract_quality_score: 40
```

### Too Many Papers?

1. Use dual keyword groups (AND logic)
2. Enable all filters
3. Set lower `max_papers` limit

### Check Results

```python
import pandas as pd

df = pd.read_csv('aggregated_data.csv', delimiter=';')

# Check scores
top = df.nlargest(10, 'relevance_score')
print(top[['title', 'relevance_score', 'quality_score', 'nb_citation']])
```