skip the navigation
sAccess Net Logo
>>Japanese Page
Infty Project
Research Project on Mathematical Information Processing
Mathematical Document Recognition and Analysis, User Interface,Accessibility of Scientific Documents

Databases

To promote further research into OCR for scientific documents, the InftyProject releases databases that may be suitable for research outside of InftyProject development. We have carefully scrutinized the data in these releases, and expect that the databases can be relied upon with high confidence. Nevertheless, if you experience any problems with them, please contact us.

1. InftyCDB-1

A Ground Truth Database of Characters, Symbols words and Formulas in Mathematical Documents; First Distribution, March 18, 2005

  • Description:

    InftyCDB-1 consists of 30 mathematics articles in English. In all, it comprises 688,580 character samples, from 476 pages of text. The image of each alphanumeric character or mathematical symbol is recorded, together with the character code of the symbol it represents. In addition, links are recorded that represent the structure of each word or mathematical expression that appears. Thus, InftyCDB-1 can be used as a word or mathematical expression database, as well as a character image database.

    There are 108,914 words and 21,056 mathematical expressions in InftyCDB-1. For more details about the database, please see here.

  • Conditions of use:

    Usage for research, development, or testing of OCR systems (possibly commercial) for scientific documents is permitted, free of charge.

  • Availability:

    Please complete the user registration. After doing so, you will be sent a URL from which you can download InftyCDB-1.

2. InftyCDB-2

A Ground Truth Database of Characters, Symbols, words and Formulas in Mathematical Documents; Second Distribution, December 27, 2006New!

  • Description:

    This is a continuation of InftyCDB-1, with the same structure. It contains some documents in German and French, as well as many in English.

    There are 662,142 characters from English articles, 37,439 from French articles, and 77,812 from German articles. For a complete list of the articles, see here. Note that the database was corrected recently; details about the revisions are here.

  • Conditions of use:

    Same as InftyCDB-1.

  • Availability:

    Please complete the user registration. After doing so, you will be sent a URL from which you can download InftyCDB-2.

3. InftyCDB-3

A Ground Truth Database of Characters, Symbols in Mathematical Documents; Third Distribution, October 24, 2006

  • Description:

    InftyCDB-3 is a database of single alphanumeric characters and mathematical symbols, divided into two data sets. Unlike InftyCDB-1 and InftyCDB-2, word and mathematical expression structure is not included. The images are of individual characters only. To make it easy to use for experimentation and development with single-character recognition engines, symbols whose form is identical (for example, the summation symbol and the Greek capital sigma) are assigned the same character code.

    In InftyCDB-3-A, there are 188,752 characters; in InftyCDB-3-B, there are 70,637 characters.

  • InftyCDB-3-A is the training set used to produce recent versions of InftyReader (Versions 2.0 - 2.5.0). Taking data from more than 300 sources, we have tried to cover as many varieties of characters and symbols as possible. The data was extracted from books and journals from various publishers, Japanese documents, typesetting samples from printing companies, fonts installed on personal computer operating systems, and LaTeX fonts.

  • InftyCDB-3-B is an extract of InftyCDB-1, which includes data from 20 of its articles. To reduce the number of samples with the same character code, size, and shape, clustering was applied to the data from these 20 articles, reducing the number of data samples to about 70,000. The data is written in the same format as in InftyCDB-3-A.

    Please see this explanation for more details.

  • Note:

    This data set does not include any German symbols.

  • Conditions of use:

    Same as InftyCDB-1.

  • Availability:

    Please complete the user registration. After doing so, you will be sent a URL from which you can download InftyCDB-3.


 >Top of this page 
Go to the Top Page of Infty Project

Mail to Web Master