neurox.data.extraction#

Submodules:

neurox.data.extraction.transformers_extractor#

Representations Extractor for transformers toolkit models.

Module that given a file with input sentences and a transformers model, extracts representations from all layers of the model. The script supports aggregation over sub-words created due to the tokenization of the provided model.

Can also be invoked as a script as follows:

python -m neurox.data.extraction.transformers_extractor

neurox.data.extraction.transformers_extractor.get_model_and_tokenizer(model_desc, device='cpu', random_weights=False)[source]#

Automatically get the appropriate transformers model and tokenizer based on the model description

Parameters:
  • model_desc (str) – Model description; can either be a model name like bert-base-uncased, a comma separated list indicating <model>,<tokenizer> (since 1.0.8), or a path to a trained model

  • device (str, optional) – Device to load the model on, cpu or gpu. Default is cpu.

  • random_weights (bool, optional) – Whether the weights of the model should be randomized. Useful for analyses where one needs an untrained model.

Returns:

  • model (transformers model) – An instance of one of the transformers.modeling classes

  • tokenizer (transformers tokenizer) – An instance of one of the transformers.tokenization classes

neurox.data.extraction.transformers_extractor.aggregate_repr(state, start, end, aggregation)[source]#

Function that aggregates activations/embeddings over a span of subword tokens. This function will usually be called once per word. For example, if we had the sentence:

This is an example

which is tokenized by BPE into:

this is an ex @@am @@ple

The function should be called 4 times:

aggregate_repr(state, 0, 0, aggregation)
aggregate_repr(state, 1, 1, aggregation)
aggregate_repr(state, 2, 2, aggregation)
aggregate_repr(state, 3, 5, aggregation)

Returns a zero vector if end is less than start, i.e. the request is to aggregate over an empty slice.

Parameters:
  • state (numpy.ndarray) – Matrix of size [ NUM_LAYERS x NUM_SUBWORD_TOKENS_IN_SENT x LAYER_DIM]

  • start (int) – Index of the first subword of the word being processed

  • end (int) – Index of the last subword of the word being processed

  • aggregation ({'first', 'last', 'average'}) – Aggregation method for combining subword activations

Returns:

word_vector – Matrix of size [NUM_LAYERS x LAYER_DIM]

Return type:

numpy.ndarray

neurox.data.extraction.transformers_extractor.extract_sentence_representations(sentence, model, tokenizer, device='cpu', include_embeddings=True, aggregation='last', dtype='float32', include_special_tokens=False, tokenization_counts={})[source]#

Get representations for a single sentence

The extractor runs a detokenization procedure to combine subwords automatically. For instance, a sentence “Hello, how are you?” may be tokenized by the model as “Hell @@o , how are you @@?”. This extractor automatically detokenizes the subtokens back into the original token.

Parameters:
  • sentence (str) – Sentence for which the extraction needs to be done. The returned output will have representations for exactly the same number of elements as tokens in this sentence (counted by sentence.split(’ ‘)).

  • model (transformers model) – An instance of one of the transformers.modeling classes

  • tokenizer (transformers tokenizer) – An instance of one of the transformers.tokenization classes

  • device (str, optional) – Specifies the device (CPU/GPU) on which the extraction should be performed. Defaults to ‘cpu’

  • include_embeddings (bool, optional) – Whether the embedding layer should be included in the final output, or just regular layers. Defaults to True

  • aggregation ({'first', 'last', 'average'}, optional) – Aggregation method for combining subword activations. Defaults to ‘last’

  • dtype (str, optional) – Data type in which the activations will be stored. Supports all numpy based tensor types. Common values are ‘float32’ and ‘float16’. Defaults to ‘float16’

  • include_special_tokens (bool, optional) – Whether or not to special tokens in the extracted representations. Special tokens are tokens not present in the original sentence, but are added by the tokenizer, such as [CLS], [SEP] etc.

  • tokenization_counts (dict, optional) – Tokenization counts to use across a dataset for efficiency

Returns:

  • final_hidden_states (numpy.ndarray) – Numpy Matrix of size [NUM_LAYERs x NUM_TOKENS x NUM_NEURONS].

  • detokenizer (list) – List of detokenized words. This will have the same number of elements as tokens in the original sentence, plus special tokens if requested. Each element preserves tokenization artifacts (such as ##, @@ etc) to enable further automatic processing.

neurox.data.extraction.transformers_extractor.extract_representations(model_desc, input_corpus, output_file, device='cpu', aggregation='last', output_type='json', random_weights=False, ignore_embeddings=False, decompose_layers=False, filter_layers=None, dtype='float32', include_special_tokens=False)[source]#

Extract representations for an entire corpus and save them to disk

Parameters:
  • model_desc (str) – Model description; can either be a model name like bert-base-uncased, a comma separated list indicating <model>,<tokenizer> (since 1.0.8), or a path to a trained model

  • input_corpus (str) – Path to the input corpus, where each sentence is on its separate line

  • output_file (str) – Path to output file. Supports all filetypes supported by data.writer.ActivationsWriter.

  • device (str, optional) – Specifies the device (CPU/GPU) on which the extraction should be performed. Defaults to ‘cpu’

  • aggregation ({'first', 'last', 'average'}, optional) – Aggregation method for combining subword activations. Defaults to ‘last’

  • output_type (str, optional) – Explicit definition of output file type if it cannot be derived from the output_file path

  • random_weights (bool, optional) – Whether the weights of the model should be randomized. Useful for analyses where one needs an untrained model. Defaults to False.

  • ignore_embeddings (bool, optional) – Whether the embedding layer should be excluded in the final output, or kept with the regular layers. Defaults to False

  • decompose_layers (bool, optional) – Whether each layer should have it’s own output file, or all layers be saved in a single file. Defaults to False, i.e. single file

  • filter_layers (str) – Comma separated list of layer indices to save. The format is the same as the one accepted by data.writer.ActivationsWriter.

  • dtype (str, optional) – Data type in which the activations will be stored. Supports all numpy based tensor types. Common values are ‘float32’ and ‘float16’. Defaults to ‘float16’

  • include_special_tokens (bool, optional) – Whether or not to special tokens in the extracted representations. Special tokens are tokens not present in the original sentence, but are added by the tokenizer, such as [CLS], [SEP] etc.

neurox.data.extraction.transformers_extractor.main()[source]#

Module contents: