C++ SDK Documentation  9.0
Vertica::UDChunker Class Referenceabstract

Separates parser input into chunks at record boundaries, allowing parsing to proceed in parallel. More...

Collaboration diagram for Vertica::UDChunker:
Collaboration graph

Public Member Functions

virtual ~UDChunker ()
 
virtual StreamState alignPortion (ServerInterface &srvInterface, DataBuffer &input, InputState inputState)
 
virtual void destroy (ServerInterface &srvInterface, SizedColumnTypes &returnType)
 
virtual void destroy (ServerInterface &srvInterface, SizedColumnTypes &returnType, SessionParamWriterMap &udSessionParams)
 
virtual StreamState process (ServerInterface &srvInterface, DataBuffer &input, InputState input_state)=0
 
virtual void setup (ServerInterface &srvInterface, SizedColumnTypes &returnType)
 

Protected Member Functions

Portion getPortion () const
 

Detailed Description

Separates parser input into chunks at record boundaries, allowing parsing to proceed in parallel.

Vertica invokes a UDChunker (if present) as part of the parsing phase. Chunkers are tightly coupled to parsers.

Constructor & Destructor Documentation

virtual Vertica::UDChunker::~UDChunker ( )
inlinevirtual

Dtor.

Member Function Documentation

virtual StreamState Vertica::UDChunker::alignPortion ( ServerInterface srvInterface,
DataBuffer input,
InputState  inputState 
)
inlinevirtual

UDChunker::alignPortion()

See "Apportioned Load".

Reads data from the input stream and aligns the chunker to the beginning of the first chunk in the portion of the stream it is processing.

This is invoked repeatedly during query execution, before any calls to process(), if we are doing an apportioned load. The input buffer initially contains contents from the start of the current portion; future calls may advance along the stream. This method should identify the start of the first chunk that this UDChunker will produce, and indicate that it is now "aligned" to that boundary by setting input.offset to the corresponding offset and returning DONE.

Note that input may contain null bytes, if the source contains null bytes. Note also that input is NOT automatically null-terminated.

alignPortion() must not block indefinitely. If it cannot proceed for an extended period of time, it should return KEEP_GOING. It will be called again shortly. Failure to do this will, among other things, prevent the query from being canceled by the user.

Valid return values for this method are INPUT_NEEDED, KEEP_GOING, and DONE.

Returns
KEEP_GOING if the UDChunker must block; INPUT_NEEDED if a record boundary cannot be found in the input. Leaving input.offset unchanged will result in a larger buffer for the next invocation. Advance input.offset to advance through the data stream. DONE if a portion boundary has been identified. Set input.offset to the location of that boundary.
virtual void Vertica::UDChunker::destroy ( ServerInterface srvInterface,
SizedColumnTypes returnType 
)
inlinevirtual

UDChunker::destroy()

Will be invoked during query execution, after the last time that process() is called on this UDChunker instance for a particular input.

May optionally be overridden to perform tear-down/destruction.

See UDChunker::setup() for a note about the restartability of UDChunkers.

Portion Vertica::UDChunker::getPortion ( ) const
inlineprotected

Return a description of the portion of the input source which is being processed by the active load stack. This can be called by the UDChunker implementation to determine how much data to process.

This can be invoked during setup(), process() or destroy().

The Portion object returned will be a valid portion if any part of the load stack is processing portions. This UDChunker instance will see input states START_OF_PORTION (in alignPortion()) and/or END_OF_PORTION (in either alignPortion() or process()) if it is the stack element directly processing portions.

virtual StreamState Vertica::UDChunker::process ( ServerInterface srvInterface,
DataBuffer input,
InputState  input_state 
)
pure virtual

UDChunker::process()

Will be invoked repeatedly during query execution, until it returns DONE or until the query is canceled by the user.

On each invocation, process() will be given an input buffer. It should read data from that buffer, find record boundaries and align input.offset with the end of the last record in the buffer. Once it has processed as much as it reasonably can (for example, once it has processed the last complete row in the input buffer), it should return INPUT_NEEDED to indicate that more data is needed, or DONE to indicate that it has completed "parsing" this input stream and will not be reading more bytes from it.

if a few rows were found in current block, move offset forward to point at the start of next (potential) row, and mark state as CHUNK_ALIGNED, indicating the chunker is ready to hand this chunk to parser.

If input_state == END_OF_FILE, then the last byte in input is the last byte in the input stream. Returning INPUT_NEEDED will not result in any new input appearing. process() should return DONE in this case.

If input_state == END_OF_PORTION, then the last byte in input is the last byte in the current portion. Returning CHUNK_ALIGNED or INPUT_NEEDED will result in retrieval of data beyond the portion boundary. If a chunker which supports apportioned load sees END_OF_PORTION, it should (a) align a chunk as usual, and return CHUNK_ALIGNED or INPUT_NEEDED to retrieve the next block, and then (b) align a chunk to the FIRST record boundary which appears in the next portion.

Note that input may contain null bytes, if the source file contains null bytes. Note also that input is NOT automatically null-terminated.

process() must not block indefinitely. If it cannot proceed for an extended period of time, it should return KEEP_GOING. It will be called again shortly. Failure to do this will, among other things, prevent the query from being canceled by the user.

Returns
INPUT_NEEDED if this UDChunker has more data to process; DONE if has no more data to produce; CHUNK_ALIGNED if this UDChunker successfully aligned some rows and is ready to give them to parser.

Note that it is UNSAFE to maintain pointers or references to any of these arguments (or any other argument passed by reference into any other function in this API) beyond the scope of the function call in question. For example, do not store a reference to the server interface or the input block on an instance variable. Vertica may free and replace these objects.

virtual void Vertica::UDChunker::setup ( ServerInterface srvInterface,
SizedColumnTypes returnType 
)
inlinevirtual

UDChunker::setup()

Will be invoked during query execution, prior to the first time that process() is called on this UDChunker instance for a particular input source.

May optionally be overridden to perform setup/initialzation.

Note that UDChunkers MUST BE RESTARTABLE! If loading large numbers of files, a given UDChunker may be re-used for multiple files. Vertica follows the worker-pool design pattern: At the start of COPY execution, several Chunkers and several Filters are instantiated per node, by calling the corresponding prepare() method multiple times. Each Filter/Chunker pair is then internally assigned to an initial Source (UDSource or internal). At that point, setup() is called; then process() is called until it is finished; then destroy() is called. If there are still sources in the pool waiting to be processed, then the UDFilter/UDChunker pair will be given a second Source; setup() will be called a second time, then process() until it is finished, then destroy(). This repeats until all sources have been read.