org.basex.util.ft
Class FTLexer

java.lang.Object
  extended by org.basex.util.ft.FTIterator
      extended by org.basex.util.ft.FTLexer
All Implemented Interfaces:
java.util.Iterator<FTSpan>, IndexToken

public final class FTLexer
extends FTIterator
implements IndexToken

Performs full-text lexing on token. Calls tokenizers, stemmers matching to full-text options to achieve this.

Author:
BaseX Team 2005-12, BSD License, Jens Erat

Constructor Summary
FTLexer()
          Constructor, using the default full-text options.
FTLexer(FTOpt opt)
          Default constructor.
 
Method Summary
 int count()
          Returns total number of tokens.
 FTOpt ftOpt()
          Returns the full-text options.
 byte[] get()
          Returns the original token.
 boolean hasNext()
           
 int[][] info()
          Gets full-text info for the specified token; needed for visualizations.
 void init()
          Initializes the iterator.
 FTLexer init(byte[] txt)
          Initializes the iterator.
static StringList languages()
          Lists all languages for which tokenizers and stemmers are available.
 FTSpan next()
           
 byte[] nextToken()
          Returns the next token.
 boolean paragraph()
          Is paragraph? Does not have to be implemented by all tokenizers.
 int pos(int w, FTUnit u)
          Calculates a position value, dependent on the specified unit.
 FTLexer sc()
          Sets the special character flag.
 byte[] text()
          Returns the text to be processed.
 IndexType type()
          Returns the index type.
 
Methods inherited from class org.basex.util.ft.FTIterator
remove
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FTLexer

public FTLexer()
Constructor, using the default full-text options. Called by the XMLSerializer, FTFilter, and the map visualizations.


FTLexer

public FTLexer(FTOpt opt)
Default constructor.

Parameters:
opt - full-text options
Method Detail

sc

public FTLexer sc()
Sets the special character flag. Returns not only tokens, but also delimiters.

Returns:
self reference

init

public void init()
Initializes the iterator.


init

public FTLexer init(byte[] txt)
Description copied from class: FTIterator
Initializes the iterator.

Specified by:
init in class FTIterator
Parameters:
txt - text
Returns:
self reference

hasNext

public boolean hasNext()
Specified by:
hasNext in interface java.util.Iterator<FTSpan>

next

public FTSpan next()
Specified by:
next in interface java.util.Iterator<FTSpan>

nextToken

public byte[] nextToken()
Description copied from class: FTIterator
Returns the next token. May be called as an alternative to Iterator.next() to avoid the creation of new FTSpan instances.

Specified by:
nextToken in class FTIterator
Returns:
token

count

public int count()
Returns total number of tokens.

Returns:
token count

type

public IndexType type()
Description copied from interface: IndexToken
Returns the index type.

Specified by:
type in interface IndexToken
Returns:
type

get

public byte[] get()
Returns the original token. Inherited from IndexToken; use next() or nextToken() if not using this interface.

Specified by:
get in interface IndexToken
Returns:
current token.

ftOpt

public FTOpt ftOpt()
Returns the full-text options. Can be null.

Returns:
full-text options

text

public byte[] text()
Returns the text to be processed.

Returns:
text

paragraph

public boolean paragraph()
Is paragraph? Does not have to be implemented by all tokenizers. Returns false if not implemented.

Returns:
boolean

pos

public int pos(int w,
               FTUnit u)
Calculates a position value, dependent on the specified unit. Does not have to be implemented by all tokenizers. Returns 0 if not implemented.

Parameters:
w - word position
u - unit
Returns:
new position

info

public int[][] info()
Gets full-text info for the specified token; needed for visualizations. See Tokenizer.info() for more info.

Returns:
int arrays or empty array if not implemented

languages

public static StringList languages()
Lists all languages for which tokenizers and stemmers are available.

Returns:
supported languages