""" I would like to move towards a view-based data type for accessing corpora, instead of the current iterator-based approach. This module defines a general class for defining stream-backed corpus views. It's a rough draft and most likely still contains a bug or two. This module also includes a few example corpus view factory functions, based on the corpus view class, for accessing the brown corpus. See each function's details for example uses. """ import bisect class StreamBackedCorpusView: """ A base class for defining a 'view' for a corpus file. A C{StreamBackedCorpusView} object acts like a sequence of tokens: it can be accessed by index, iterated over, etc. However, the tokens are only constructed as-needed. The constructor to C{StreamBackedCorpusView} takes two arguments: a corpus file (which can be either a filename or a stream); and a block tokenizer. A X{block tokenizer} is a function that reads and tokenizes zero or more tokens from a stream, and returns them as a list. A very simple example of a block tokenizer is: >>> def simple_block_tokenizer(stream): ... return stream.readline().split() This simple block tokenizer reads a single line at a time, and returns a single token (consisting of a string) for each whitespace-separated substring on the line. When deciding how to define the block tokenizer for a given corpus, careful consideration should be given to the size of blocks handled by the block tokenizer. Smaller block sizes will increase the memory requirements of the corpus view's internal data structures (by 2 integers per block). On the other hand, larger block sizes may decrase performance for random access to the corpus. (But note that larger block sizes will I{not} decrease performance for iteration.) Internally, C{CorpusView} maintains a partial mapping from token index to file position, with one entry per block. When a token with a given index M{i} is requested, the C{CorpusView} constructs it as follows: 1. First, it searches the toknum/filepos mapping for the token index closest to (but less than or equal to) M{i}. 2. Then, starting at the file position corresponding to that index, it tokenizes one block at a time using the block tokenizer, until it reaches the requested token. The toknum/filepos mapping is created lazily: it is initially empty, but every time a new block is tokenized, the block's initial token is added to the mapping. (Thus, the toknum/filepos map has one entry per block.) In order to increase efficiency for random access patterns that have high degrees of locality, the corpus view may cache one or more tokenized blocks. @note: Each C{CorpusView} object internally maintains an open file object for its underlying corpus file. This file should be automatically closed when the C{CorpusView} is garbage collected, but if you wish to close it manually, use the L{close()} method. If you access a C{CorpusView}'s items after it has been closed, the file object will be automatically re-opened. @warning: If the contents of the file are modified during the lifetime of the C{CorpusView}, then the C{CorpusView}'s beahvior is undefined. @ivar _block_tokenizer: The function used to read and tokenize a single block from the underlying file stream. @ivar _toknum: A list containing the token index of each block that has been process. In particular, C{_toknum[i]} is the token index of the first token in block C{i}. Together with L{_filepos}, this forms a partial mapping between token indices and file positions. @ivar _filepos: A list containing the file position of each block that has been process. In particular, C{_toknum[i]} is the file position of the first character in block C{i}. Together with L{_toknum}, this forms a partial mapping between token indices and file positions. @ivar _stream: The stream used to access the underlying corpus file. @ivar _len: The total number of tokens in the corpus, if known; or C{None}, if the number of tokens is not yet known. @ivar _eofpos: The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached. @ivar _cache: A cache of the most recently tokenized block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block. """ def __init__(self, corpus_file, block_tokenizer): """ Create a new corpus view, based on the file C{corpus_file}, and tokenized with C{block_tokenizer}. See the class documentation for more information. """ self._block_tokenizer = block_tokenizer # Initialize our toknum/filepos mapping. self._toknum = [0] self._filepos = [0] # We don't know our length (number of tokens) yet. self._len = None # Initialize our input stream. if isinstance(corpus_file, basestring): self._stream = open(corpus_file, 'r') else: self._stream = corpus_file # Find the character position of the end of the file. self._stream.seek(0, 2) self._eofpos = self._stream.tell() # Maintain a cache of the most recently tokenized block, to # increase efficiency of random access. self._cache = (-1, -1, None) def close(self): """ Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view). The corpus view should not be used after it is closed -- doing so will raise C{IOError}s or C{OSError}s. """ self._stream.close() def __len__(self): """ Return the number of tokens in the corpus file underlying this corpus view. """ if self._len is None: # _iterate_from() sets self._len when it reaches the end # of the file: for tok in self._iterate_from(self._toknum[-1]): pass return self._len def __getitem__(self, i): """ Return the C{i}th token in the corpus file underlying this corpus view. Negative indices and spans are both supported. """ if isinstance(i, slice): start, stop = i.start, i.stop # Handle negative indices if start < 0: start = max(0, len(self)+start) if stop < 0: stop = max(0, len(self)+stop) # Check if it's n the cache. offset = self._cache[0] if offset <= start and end < self._cache[1]: return self._cache[2][start-offset:end-offset] # Construct & return the result. result = [] for i,tok in enumerate(self._iterate_from(start)): if i+start >= stop: return result result.append(tok) return result else: # Handle negative indices if i < 0: i = max(0, len(self)+i) # Check if it's in the cache. offset = self._cache[0] if offset <= i < self._cache[1]: #print 'using cache', self._cache[:2] return self._cache[2][i-offset] # Use _iterate_from to extract it. try: return self._iterate_from(i).next() except StopIteration: raise KeyError(i) def __iter__(self): """ Return an iterator that generates the tokens in the corpus file underlying this corpus view. """ return self._iterate_from(0) # If we wanted to be thread-safe, then this method would need to # do some locking. def _iterate_from(self, start_tok): """ Return an iterator that generates the tokens in the corpus file underlying this corpus view, starting at the token number C{start}. If C{start>=len(self)}, then this iterator will generate no tokens. """ # Decide where in the file we should start. If `start` is in # our mapping, then we can jump streight to the correct block; # otherwise, start at the last block we've processed. if start_tok < self._toknum[-1]: i = bisect.bisect_right(self._toknum, start_tok)-1 toknum = self._toknum[i] filepos = self._filepos[i] else: toknum = self._toknum[-1] filepos = self._filepos[-1] # Each iteration through this loop, we tokenize a single block # from the stream. while True: # Tokenize the next block. self._stream.seek(filepos) tokens = self._block_tokenizer(self._stream) num_toks = len(tokens) # Update our cache. self._cache = (toknum, toknum+len(tokens), tokens) # Update our mapping. if num_toks+toknum > self._toknum[-1]: assert num_toks > 0 # is this always true here? self._filepos.append(self._stream.tell()) self._toknum.append(toknum+num_toks) # Generate the tokens in this block (but skip any tokens # before start_tok). Note that between yields, our state # may be modified. for tok in tokens[max(0, start_tok-toknum):]: yield tok # Update our indices toknum += len(tokens) filepos = self._stream.tell() # If we're at the end of the file, then we're done; set # our length and terminate the generator. if filepos == self._eofpos: self._len = toknum + 1 return _MAX_REPR_SIZE = 60 def __repr__(self): """ @return: A string representation for this corpus view. The representation is similar to a list's representation; but if it would be more than 60 characters long, it is truncated. """ pieces = [] length = 5 for tok in self: pieces.append(repr(tok)) length += len(pieces[-1]) + 2 if length > self._MAX_REPR_SIZE and len(pieces) > 2: return '[%s, ...]' % ', '.join(pieces[:-1]) else: return '[%s]' % ', '.join(pieces) ###################################################################### #{ Corpus View Factories ###################################################################### def brown_corpus_by_word(filename): """ Provides access to the brown corpus as a sequence of C{(text, tag)} tuples (one per word). Example: >>> cp20 = brown_corpus_by_word('brown/cp20') >>> print cp20 [('I', 'ppss'), ('was', 'bedz'), ('slowly', 'rb'), ...] >>> cp20[33] # 34th word. ('knew', 'vbd') >>> len(cp20) # Number of words 2429 """ return StreamBackedCorpusView(filename, _tokenize_tagged_block_by_word) def brown_corpus_by_sent(filename): """ Provides access to the brown corpus as a sequence of lists (one per sentence) of C{(text, tag)} tuples (one per word). Example: >>> cp20 = brown_corpus_by_sent('brown/cp20') >>> cp20[2][4] # 5th word of 3rd sentence. ('were', 'bed') >>> len(cp20) # Number of sentenves 183 """ return StreamBackedCorpusView(filename, _tokenize_tagged_block_by_sent) def brown_corpus_by_para(filename): """ Provides access to the brown corpus as a sequence of lists (one per paragraph) of lists (one per sentenve) of C{(text, tag)} tuples (one per word). Example: >>> cp20 = brown_corpus_by_para(FILE) >>> print cp20[2][4][8] # 9th word of 5th sentence of 3rd paragraph ('to', 'to') >>> len(cp20) # Number of paragraphs 62 """ return StreamBackedCorpusView(filename, _tokenize_tagged_block_by_para) ###################################################################### #{ Block Tokenizers ###################################################################### def _tokenize_tagged_block_by_word(stream): """Block tokenixer for L{brown_corpus_by_word}""" return [tagged2tuple(tok) for tok in stream.readline().split()] def _tokenize_tagged_block_by_sent(stream): """Block tokenixer for L{brown_corpus_by_sent}""" line = stream.readline().strip() if not line: return [] return [[tagged2tuple(tok) for tok in line.split()]] def _tokenize_tagged_block_by_para(stream): """Block tokenixer for L{brown_corpus_by_para}""" para = [] while True: line = stream.readline().strip() if line: para.append([tagged2tuple(tok) for tok in line.split()]) elif para: return [para] else: return [] ###################################################################### #{ Helpers ###################################################################### def tagged2tuple(tok): """C{'text/tag'} S{->} C{('text', 'tag')}""" pieces = tok.split('/', 1) if len(pieces) == 2: return (pieces[0], pieces[1]) else: return (pieces[0], None) ###################################################################### #{ Testing ###################################################################### FILE='/home/edloper/data/projects/nltk/data/brown/cp20' # Do random operations and make sure we get the right results. import sys, random test=[tagged2tuple(t) for t in open(FILE).read().split()] for i in range(0): sys.stdout.write('.'); sys.stdout.flush() cv = brown_corpus_by_word(FILE) for j in range(100): if random.random() < .5: # Random access. k = random.randint(0, len(test)-1) assert cv[k] == test[k] if random.random() < .1: # Iterate for k, tok in enumerate(cv): assert tok == test[k] if random.random() < .02: break if random.random() < .02: assert list(cv) == test cv = brown_corpus_by_para(FILE) print cv[1][1][1], 'reasons' # second word of second sentence of second para cv = brown_corpus_by_sent(FILE) print cv[8][1], 'metal-tasting' # second word of ninth sentence