Natural Language Toolkit

...software, data sets and tutorials for natural language processing...

Development

 

From NLTK

Jump to: navigation, search
Known bugs
  • certain re-entrant structures in featstruct
  • when the LHS of an edge contains an ApplicationExpression, variable values in the RHS bindings aren't copied over when the fundamental rule applies
  • HMM tagger tags everything as ' in some situations
Misc tasks for contributors
  • Using doctest, write regression test cases for some of the NLTK modules that are currently lacking regression testing.
  • Check the epydoc API documentation strings to make sure they're still up-to-date and accurate; and where necessary, fill in missing documentation strings.

Contents

Version 0.9.4 (July)

  • put logic branch back in trunk
  • move rst package into core nltk
  • replace ConditionalFreqDist with defaultdict(FreqDist)
  • change FreqDist to wrap a dictionary and pass through: getitem, iter, contains, len
  • add plot method to FreqDist
  • add doc_contrib to top level Makefile
  • epydoc display of docstrings in treetransforms.py
  • doctest code in function docstrings?
  • improve installation instructions on CD-ROM
  • nltk.tree.Tree documentation (including node attribute)
  • access to semcor frequency data in wordnet api
  • interface to Wordnet index
  • Provide access to WordNet senses (so that we can navigate from a synset to its component word senses (not words), as requested by Tim Mahrt

Version 0.9.5 (August)

  • good off-the-shelf tokenizer, tagger and chunker
  • web-as-corpus interface with caching
  • add OLAC records for each corpus

NLTK-Lite Version 1.0 / NLTK Version 2.0 (late 2008)

Once it reaches version 1.0, NLTK-Lite's name will be changed back to NLTK and assigned version 2.0. This will coincide with the publication of the NLTK Book. From this point onwards, names and interfaces will be frozen for at least a year. Subsequent changes will be conservative and will support backwards compatibity wherever possible.

  • corpora hosted in external archive
  • off-the shelf components/models as part of distribution (cf punkt)
    • resolution of how models are built and distributed
    • possible core-data as part of single nltk distro, vs larger repository for many corpora
  • single download per platform

Unscheduled tasks

  • material on writing (adapting?) a corpus reader
  • Simple n-gram language modeling, interpolated and backoff language models

Software

  • Marshalling
  • integrate more student projects (incl TAG, textcat, paradigms)
  • add sequence values to FeatureStructure
  • decision list classifier
  • collocation support (chi-sq, PMI, spearman rank correlation, etc)
  • new material on data modelling (interlinear text, paradigms)
  • lexical semantics
  • information extraction (e.g. from biomedical literature)
  • regular expressions for extracting temporal expressions
  • terminological difference in chart parsing with Jurafsky and Martin textbook
  • Text Tiling
  • WordNet similarity: Gloss Vector similarity

Corpora

  • more LDC corpus samples (Fisher?)
  • SRL corpus and reader
  • more tree data
  • MUC 6 or 7 data
  • MWE corpus (Nicholson)
  • Mawu corpus sample
  • Yemba lexicon

Housekeeping

  • Unicode compliance
  • check graphical demos on windows machines (add cf.mainloop()?)
Personal tools