Development
From NLTK
- nltk-devel - the mailing list where development plans are discussed
- Feature Requests - new functionality we'd like to add
- Bugs - problems we'd like to fix
- Development Projects - ongoing projects by members of the NLTK community
- Developers Guide - how we collaborate
- Eclipse - how to set up an IDE for NLTK development
- Projects - suggested projects
- Known bugs
-
- certain re-entrant structures in featstruct
- when the LHS of an edge contains an ApplicationExpression, variable values in the RHS bindings aren't copied over when the fundamental rule applies
- HMM tagger tags everything as ' in some situations
- Misc tasks for contributors
-
- Using doctest, write regression test cases for some of the NLTK modules that are currently lacking regression testing.
- Check the epydoc API documentation strings to make sure they're still up-to-date and accurate; and where necessary, fill in missing documentation strings.
Contents |
[edit]
Version 0.9.4 (July)
- put logic branch back in trunk
- move rst package into core nltk
- replace ConditionalFreqDist with defaultdict(FreqDist)
- change FreqDist to wrap a dictionary and pass through: getitem, iter, contains, len
- add plot method to FreqDist
- add doc_contrib to top level Makefile
- epydoc display of docstrings in treetransforms.py
- doctest code in function docstrings?
- improve installation instructions on CD-ROM
- nltk.tree.Tree documentation (including node attribute)
- access to semcor frequency data in wordnet api
- interface to Wordnet index
- Provide access to WordNet senses (so that we can navigate from a synset to its component word senses (not words), as requested by Tim Mahrt
[edit]
Version 0.9.5 (August)
- good off-the-shelf tokenizer, tagger and chunker
- web-as-corpus interface with caching
- add OLAC records for each corpus
[edit]
NLTK-Lite Version 1.0 / NLTK Version 2.0 (late 2008)
Once it reaches version 1.0, NLTK-Lite's name will be changed back to NLTK and assigned version 2.0. This will coincide with the publication of the NLTK Book. From this point onwards, names and interfaces will be frozen for at least a year. Subsequent changes will be conservative and will support backwards compatibity wherever possible.
- corpora hosted in external archive
- off-the shelf components/models as part of distribution (cf punkt)
- resolution of how models are built and distributed
- possible core-data as part of single nltk distro, vs larger repository for many corpora
- single download per platform
[edit]
Unscheduled tasks
- material on writing (adapting?) a corpus reader
- Simple n-gram language modeling, interpolated and backoff language models
[edit]
Software
- Marshalling
- integrate more student projects (incl TAG, textcat, paradigms)
- add sequence values to FeatureStructure
- decision list classifier
- collocation support (chi-sq, PMI, spearman rank correlation, etc)
- new material on data modelling (interlinear text, paradigms)
- lexical semantics
- information extraction (e.g. from biomedical literature)
- regular expressions for extracting temporal expressions
- terminological difference in chart parsing with Jurafsky and Martin textbook
- Text Tiling
- WordNet similarity: Gloss Vector similarity
[edit]
Corpora
- more LDC corpus samples (Fisher?)
- SRL corpus and reader
- more tree data
- MUC 6 or 7 data
- MWE corpus (Nicholson)
- Mawu corpus sample
- Yemba lexicon
[edit]
Housekeeping
- Unicode compliance
- check graphical demos on windows machines (add cf.mainloop()?)



