Natural Language Toolkit

...software, data sets and tutorials for natural language processing...

Projects

 

From NLTK

Jump to: navigation, search

This page describes a variety of possible natural language processing projects that can be undertaken using NLTK. Several past projects are now a core part of NLTK. Please contact the NLTK team to suggest other project ideas. Note that many smaller programming tasks are described in the textbook exercises.

As far as possible, code that is developed in these projects should build on existing NLTK modules, especially the interface classes and APIs. In general, you should ensure that you complement your code with appropriate testing data. You are strongly encouraged to write a doctest document which will both explain the functionality of your code to users and also provide unit tests, in the style adopted in NLTK Guides.

The distinction between Computationally Oriented and Linguistically Oriented Projects is not hard and fast, since every project should have some mix of both aspects.

Computationally Oriented Projects

  1. Port the Snowball stemmers to NLTK.
  2. Build a compiler for finite state transducers (cf Xerox's xfst or Gertjan van Noord's FSA Utilities).
  3. Develop a maximum-entropy POS tagger for NLTK (e.g. see MXPOST).
  4. Provide a framework that will make it easy to run NLTK processors over very large data sets using the Hadoop implementation of MapReduce.
  5. Develop a chunker that uses transformation-based learning, adapting NLTK's Brill Tagger to chunk tags (see Ramshaw 1995).
  6. Develop a lexical-chain based WSD system, using the similarity measures defined on WordNet, and evaluate it using the SEMCOR corpus (corpus reader provided in NLTK).
  7. Build a system for aligning words and / or sentences in parallel corpora (see [1] for some starting points).
  8. Implement a dependency parser (cf [2]).
  9. Create a natural language generator (surface realizer component only) based on the FUF/SURGE system.
  10. Build and train a statistical Named Entity Recognizer for MUC-type entities (e.g., person, location, organisation, cardinal, duration, measure, date).
  11. Implement a chatbot that incorporates a more sophisticated dialogue model than nltk.chat.eliza.
  12. Build a chatbot that adapts machine-translation technology to map from an input utterance (source language) to an appropriate response (target language); for example, by using word alignments, "translation" probabilities and language models.
  13. Build an extensible state-based dialogue manager.
  14. Implement a categorial grammar parser, including semantic representations, cf nltk_contrib.lambek.
  15. Develop a prepositional phrase attachment classifier, using the ``ppattach`` corpus for training and testing.
  16. Taking the VerbOcean data which captures semantic relationships between verbs [3], generate a semantic network of verb relationships and implement a tree traversal algorithm that can calculate the similarity between two verbs, e.g. "fly" and "crash". You can find a demo of this system at [4].
  17. News stories from different sources often contain contradictory information regarding a particular event such as the number of people killed in an earthquake. Build a numerical expression recogniser and resolver that can identify equality and contradiction between numerical expression such as: "5 adults" != "3 children and 2 adults", but "5 people" = "3 children and 2 adults".
  18. Develop a system for encoding lexicons that can be incorporated into existing NLTK code for parsing feature-based grammars (cf the treatment of lexion files in PC-PATR. Ideally this should include readers / writers for a number of existing lexical formats as well as the creation of new lexicons.
  19. Build a GUI-based grammar development environment that will help users identify and fix bugs in their grammars.
  20. Build a replacement for nltk.sem.logic which will parse standard First Order Logic formulas, while still supporting lambda abstraction, and alpha- and beta-conversion.
  21. Re-implement any NLTK functionality for a language other than English (tokenizer, tagger, chunker, parser, etc). You will probably need to collect suitable corpora, and develop corpus readers.
  22. Use the names corpus to identify first names that are not ambiguous for gender, then build a gender classification system for newsgroup postings (e.g. the 20-Newsgroups Corpus [5]), to test the hypothesis that males and females use language differently.
  23. Write a program to generate referring expressions: assume a collection of entities having attributes for shape, color, size, etc, then generate a noun phrase that mentions enough attributes in order to uniquely identify the intended entity (e.g. "the small green book")
  24. Build a text classification system for one of the classified corpora included with NLTK (movie_reviews, qc, reuters), using or extending the nltk.classify package.
  25. Implement the TextTiling algorithm for segmenting text [6]

Linguistically Oriented Projects

  1. Develop a morphological analyser for a language of your choice.
  2. Develop a non-trivial grammar fragment that can be parsed with nltk.parse.featurechart.
  3. Develop a coreference resolution system, cf LingPipe or an anaphora resolution system, cf MARS.
  4. Build a shallow discourse parser, which takes chunked or parsed sentences as input and yields a discourse structure as output, cf. SPADE, the Penn Discourse Treebank (PTB), Prasad et al., Dinesh et al..
  5. Write a soundex function that is appropriate for a language you are interested in. If the language has clusters (consonants or vowels), consider how reliably people can discriminate the second and subsequent member of a cluster. If these are highly confusable, ignore them in the signature. If the *order* of segments in a cluster leads to confusion, normalize this in the signature (e.g. sort each cluster alphabetically, so that a word like treatments would be normalized to rtaemtenst, before the code is computed).
  6. Develop a text classification system which efficiently classifies documents in two or three closely related languages. Consider the discriminating features between languages despite their apparent similarity. Implementation should be evaluated using unseen data.
  7. Explore the phonotactic system of a language you are interested in. Compare your findings to a published phonological or grammatical description of the same language.
  8. Implement a structured text rendering module which takes linguistic data from a source such as Shoebox and generates XML based lexicon or interlinear text based on user preferences for field exports.
  9. Develop a grammatical paradigm generation function which takes some form of tagged text as input and generates paradigm representations of related linguistic features.
  10. Develop and automatic essay assessment tool, cf [7].
  11. Build character n-gram models for different languages using the UDHR corpus (included with NLTK), and use these to generate hypothetical proper names in these languages (cf. Pywordgen)
  12. Develop a program for unsupervised learning of phonological rules, using the method described by Goldwater and Johnson [8]
  13. Use WordNet to infer lexical semantic relationships on the entries of a Shoebox lexicon for some arbitrary language.
  14. Develop support for competitive grammar writing, cf [9]
  15. Implement a TGrep2 interpreter for querying treebanks [10]

Other Sources of Ideas for NLTK Projects

Personal tools