neurox.data.extraction#
Submodules:
neurox.data.extraction.transformers_extractor#
Representations Extractor for transformers toolkit models.
Module that given a file with input sentences and a transformers
model, extracts representations from all layers of the model. The script
supports aggregation over sub-words created due to the tokenization of
the provided model.
- Can also be invoked as a script as follows:
python -m neurox.data.extraction.transformers_extractor
- neurox.data.extraction.transformers_extractor.get_model_and_tokenizer(model_desc, device='cpu', random_weights=False)[source]#
Automatically get the appropriate
transformersmodel and tokenizer based on the model description- Parameters:
model_desc (str) – Model description; can either be a model name like
bert-base-uncased, a comma separated list indicating <model>,<tokenizer> (since 1.0.8), or a path to a trained modeldevice (str, optional) – Device to load the model on, cpu or gpu. Default is cpu.
random_weights (bool, optional) – Whether the weights of the model should be randomized. Useful for analyses where one needs an untrained model.
- Returns:
model (transformers model) – An instance of one of the transformers.modeling classes
tokenizer (transformers tokenizer) – An instance of one of the transformers.tokenization classes
- neurox.data.extraction.transformers_extractor.aggregate_repr(state, start, end, aggregation)[source]#
Function that aggregates activations/embeddings over a span of subword tokens. This function will usually be called once per word. For example, if we had the sentence:
This is an example
which is tokenized by BPE into:
this is an ex @@am @@ple
The function should be called 4 times:
aggregate_repr(state, 0, 0, aggregation) aggregate_repr(state, 1, 1, aggregation) aggregate_repr(state, 2, 2, aggregation) aggregate_repr(state, 3, 5, aggregation)
Returns a zero vector if end is less than start, i.e. the request is to aggregate over an empty slice.
- Parameters:
state (numpy.ndarray) – Matrix of size [ NUM_LAYERS x NUM_SUBWORD_TOKENS_IN_SENT x LAYER_DIM]
start (int) – Index of the first subword of the word being processed
end (int) – Index of the last subword of the word being processed
aggregation ({'first', 'last', 'average'}) – Aggregation method for combining subword activations
- Returns:
word_vector – Matrix of size [NUM_LAYERS x LAYER_DIM]
- Return type:
numpy.ndarray
- neurox.data.extraction.transformers_extractor.extract_sentence_representations(sentence, model, tokenizer, device='cpu', include_embeddings=True, aggregation='last', dtype='float32', include_special_tokens=False, tokenization_counts={})[source]#
Get representations for a single sentence
The extractor runs a detokenization procedure to combine subwords automatically. For instance, a sentence “Hello, how are you?” may be tokenized by the model as “Hell @@o , how are you @@?”. This extractor automatically detokenizes the subtokens back into the original token.
- Parameters:
sentence (str) – Sentence for which the extraction needs to be done. The returned output will have representations for exactly the same number of elements as tokens in this sentence (counted by sentence.split(’ ‘)).
model (transformers model) – An instance of one of the transformers.modeling classes
tokenizer (transformers tokenizer) – An instance of one of the transformers.tokenization classes
device (str, optional) – Specifies the device (CPU/GPU) on which the extraction should be performed. Defaults to ‘cpu’
include_embeddings (bool, optional) – Whether the embedding layer should be included in the final output, or just regular layers. Defaults to True
aggregation ({'first', 'last', 'average'}, optional) – Aggregation method for combining subword activations. Defaults to ‘last’
dtype (str, optional) – Data type in which the activations will be stored. Supports all numpy based tensor types. Common values are ‘float32’ and ‘float16’. Defaults to ‘float16’
include_special_tokens (bool, optional) – Whether or not to special tokens in the extracted representations. Special tokens are tokens not present in the original sentence, but are added by the tokenizer, such as [CLS], [SEP] etc.
tokenization_counts (dict, optional) – Tokenization counts to use across a dataset for efficiency
- Returns:
final_hidden_states (numpy.ndarray) – Numpy Matrix of size [
NUM_LAYERsxNUM_TOKENSxNUM_NEURONS].detokenizer (list) – List of detokenized words. This will have the same number of elements as tokens in the original sentence, plus special tokens if requested. Each element preserves tokenization artifacts (such as ##, @@ etc) to enable further automatic processing.
- neurox.data.extraction.transformers_extractor.extract_representations(model_desc, input_corpus, output_file, device='cpu', aggregation='last', output_type='json', random_weights=False, ignore_embeddings=False, decompose_layers=False, filter_layers=None, dtype='float32', include_special_tokens=False)[source]#
Extract representations for an entire corpus and save them to disk
- Parameters:
model_desc (str) – Model description; can either be a model name like
bert-base-uncased, a comma separated list indicating <model>,<tokenizer> (since 1.0.8), or a path to a trained modelinput_corpus (str) – Path to the input corpus, where each sentence is on its separate line
output_file (str) – Path to output file. Supports all filetypes supported by
data.writer.ActivationsWriter.device (str, optional) – Specifies the device (CPU/GPU) on which the extraction should be performed. Defaults to ‘cpu’
aggregation ({'first', 'last', 'average'}, optional) – Aggregation method for combining subword activations. Defaults to ‘last’
output_type (str, optional) – Explicit definition of output file type if it cannot be derived from the
output_filepathrandom_weights (bool, optional) – Whether the weights of the model should be randomized. Useful for analyses where one needs an untrained model. Defaults to False.
ignore_embeddings (bool, optional) – Whether the embedding layer should be excluded in the final output, or kept with the regular layers. Defaults to False
decompose_layers (bool, optional) – Whether each layer should have it’s own output file, or all layers be saved in a single file. Defaults to False, i.e. single file
filter_layers (str) – Comma separated list of layer indices to save. The format is the same as the one accepted by
data.writer.ActivationsWriter.dtype (str, optional) – Data type in which the activations will be stored. Supports all numpy based tensor types. Common values are ‘float32’ and ‘float16’. Defaults to ‘float16’
include_special_tokens (bool, optional) – Whether or not to special tokens in the extracted representations. Special tokens are tokens not present in the original sentence, but are added by the tokenizer, such as [CLS], [SEP] etc.
Module contents: