View Presentation
Methodology
- Scholarly volunteers manually identified scientific names on random sample of 392 pages in BHL (0.01% of the BHL corpus at the time of the study).
- Compared those names against OCR text, then two name finding algorithms (TaxonFinder & FAT)
- Number of Pages: 392
- Average Number of Words per Page: 446.8
- Average Number of Names per Page: 7.7
- Total Number of Names: 3003
- Total Number of Unique Names: 2610
- Of the 3,003 names, 1,056 were incorrectly transcribed by OCR, for an error rate of 35.16%
- Top OCR errors
1 Insert Space
2 Omit Space
3 e->c
4 u->I
5 u->n
6 i->l
7 c->e
8 n->v
9 l->i
10 r->i
11 u->ii
12 h->l
13 h->ii
14 e->o
- TaxonFinder
- Excluding names with OCR errors
- Precision 40.32%
Recall 36.62%
F-score 38.47% - Including names with OCR errors
- Precision 43.77%
Recall 25.82%
F-score 34.80% - FAT
- Excluding names with OCR errors
- Precision 28.20%
Recall 23.34%
F-score 25.77% - Including names with OCR errors
- Precision 32.25%
Recall 17.21%
F-score 24.73%
- Improving OCR software is out of current scope for BHL
- investigations into Tesseract may be worthwhile
- Rekeying is too expensive and will not scale
- Enhance “fuzzy” retrieval in algorithms
- Exception rules to overcome OCR errors
- More work needed in this space
- Project wiki, including downloadable datasets
- TDWG Presentation
- TDWG Poster (page size)

0 comments:
Post a Comment