|
|
Databases
To promote further research into OCR for scientific documents,
the InftyProject releases databases that may be suitable for research
outside of InftyProject development.
We have carefully scrutinized the data in these releases,
and expect that the databases can be relied upon with high confidence.
Nevertheless, if you experience any problems with them, please
contact us.
1. InftyCDB-1
A Ground Truth Database
of Characters, Symbols words and Formulas in Mathematical Documents;
First Distribution, March 18, 2005
- Description:
InftyCDB-1 consists of 30 mathematics
articles in English. In all, it comprises 688,580 character samples, from
476 pages of text. The image of each alphanumeric character or mathematical
symbol is recorded, together with the character code of the symbol
it represents.
In addition, links are recorded that represent the structure of
each word or mathematical expression that appears.
Thus, InftyCDB-1 can be used as a word or mathematical expression database,
as well as a character image database.
There are 108,914 words and 21,056 mathematical expressions in InftyCDB-1.
For more details about the database, please see
here.
- Conditions of use:
Usage for research, development,
or testing of OCR systems (possibly commercial) for scientific documents
is permitted, free of charge.
- Availability:
Please complete the
user registration,
then you will receive a URL from which you can download InftyCDB-1.
2. InftyCDB-2
A Ground Truth Database
of Characters, Symbols, words and Formulas in Mathematical Documents;
Second Distribution, December 27, 2006
- Description:
This is a continuation of InftyCDB-1,
with the same structure. It contains some documents in German and French,
as well as many in English.
There are 662,142 characters from English articles, 37,439 from French
articles, and 77,812 from German articles. For a complete list
of the articles, see here.
Note that the database was corrected recently; details about the
revisions are here.
- Conditions of use:
Same as InftyCDB-1.
- Availability:
Please complete the
user registration,
then you will receive a URL from which you can download InftyCDB-2.
3. InftyCDB-3 A Ground Truth Database
of Characters, Symbols in Mathematical Documents;
Third Distribution, October 24, 2006
- Description:
InftyCDB-3 is a database of single alphanumeric characters and mathematical
symbols, divided into two data sets. Unlike InftyCDB-1 and InftyCDB-2,
word and mathematical expression structure is not included.
The images are of individual characters only.
To make it easy to use for experimentation and development with
single-character recognition engines, symbols whose form is identical
(for example, the summation symbol and the Greek capital sigma) are
assigned the same character code.
In InftyCDB-3-A, there are 188,752 characters; in InftyCDB-3-B, there
are 70,637 characters.
-
InftyCDB-3-A is the training set used to produce recent versions of
InftyReader (Versions 2.0 - 2.5.0). Taking data from more than 300
sources, we have tried to cover as many varieties of characters and
symbols as possible. The data was extracted from
books and journals from various publishers,
Japanese documents, typesetting samples from printing companies, fonts
installed on personal computer operating systems, and LaTeX fonts.
InftyCDB-3-B is an extract of InftyCDB-1, which includes data from 20 of
its articles. To reduce the number of samples with the same
character code, size, and shape, clustering was applied to the data
from these 20 articles, reducing the number of data samples to
about 70,000. The data is written in the same format as in InftyCDB-3-A.
Please see this explanation
for more details.
- Note:
This data set does not include any German
symbols.
- Conditions of use:
Same as InftyCDB-1.
- Availability:
Please complete the
user registration, then you will receive a URL from which you can download InftyCDB-3.
3. InftyMDB-1
A Ground Truth Database of Mathematical Expressions, August 12, 2009,
Finding errors in the recognition is an important task in OCR.. This database was prepared to be used in the research and development of the algorithm to find misrecognitions in mathematical OCR.
InftyProject wishes everyone to use the database to evaluate new verification methods.
This database was used in the paper "A. Fujiyoshi, M. Suzuki, and S. Uchida, Verification of mathematical formulae based on a combination of context-free grammar and tree grammar, in the proceedings of MKM 2008, pp. 415-429, LNCS(LNAI) 5144, 2008."
The database was constructed by collecting 3,000 correctly recognized mathematical formulae and 1,400 misrecognized mathematical formulae generated by InftyReader.
The formulae were collected from 32 pure mathematical articles, in which 30 articles are the same as in InftyCDB-1.
Each mathimatical formula in the database consist of 10 or more symbols.
Original images of formulae are available.
A CSV file with corrected results of misrecognitions, the ground-truthed data for the misrecognized mathematical formulae, is also available.
- To download the database file, please click the following:
InftyMDB-1
|