BIOS: Suite of Syntactico-Semantic Analyzers


Bios is a suite of syntactico-semantico analyzers that include the most common tools needed for the shallow analysis of English text. The following tools are currently included:

  • Smart tokenizer that recognizes abbreviations, SGML tags etc.

  • Part-of-speech (POS) tagger. The POS tagger is implemented as a a wrapper around the TNT tagger by Thorsten Brants.

  • Syntactic chunking using the labels promoted by the CoNLL chunking evaluations.

  • Named-Entity Recognition and Classification (NERC) for the CoNLL entity types plus an additional 11 numerical entity types.

Why should you use this software? There are at least 4 reasons:

  1. You can configure it for very high accuracy but slower execution (using Yamcha) or for high speed and slightly lower accuracy (using my own implementation of an asymmetric Perceptron). Note: Maximum Entropy (ME) is also supported but the ME models are not included in this package because both accuracy and response time are below those of the Perceptron. See the project NEWS file for performance numbers using Yamcha and Perceptron.

  2. It has built-in models for both case-sensitive and case-insensitive text.

  3. You can retrain Bios on your own corpus and label set.

  4. It has a clean and easy to use Java API.