neurox.data

neurox.data#

Subpackages:

neurox.data.extraction
- neurox.data.extraction.transformers_extractor

Submodules:

neurox.data.annotate#

Given a list of sentences, their activations and a pattern, create a binary labeled dataset based on the pattern where pattern can be a regular expression, a list of words and a function. For example, one can create a binary dataset of years vs. not-years (2004 vs. this) by specifying the regular expression that matches the pattern of year. The program will extract positive class examples based on the provided filter and will consider rest of the examples as negative class examples. The output of the program is a word file, a label file and an activation file.

neurox.data.annotate.annotate_data(source_path, activations_path, binary_filter, output_prefix, output_type='hdf5', decompose_layers=False, filter_layers=None)[source]#

Given a set of sentences, per word activations, a binary_filter and output_prefix, creates binary data and save it to the disk. A binary filter can be a set of words, a regex object or a function

Parameters:

source_path (text file with one sentence per line) –
activations (list) – A list of sentence-wise activations
binary_filter (a set of words or a regex object or a function) –
output_prefix (prefix of the output files that will be saved as the output of this script) –

Return type:

Saves a word file, a binary label file and their activations

Example

annotate_data(source_path, activations_path, re.compile(r’^ww$’)) select words of two characters only as a positive class annotate_data(source_path, activations_path, {‘is’, ‘can’}) select occrrences of ‘is’ and ‘can’ as a positive class

neurox.data.control_task#

neurox.data.control_task.create_sequence_labeling_dataset(train_tokens, dev_source=None, test_source=None, case_sensitive=True, sample_from='same')[source]#

Method that prepares labels for a control task, as defined in §2.1 of Hewitt and Liang (2019) <https://aclanthology.org/D19-1275.pdf>

Target classes are selected randomly for each token type in the datasets. The number of control task classes is the same as the number of classes in train_tokens['target']. The distribution of control task labels can be specified.

Parameters:

train_tokens (dict) – Dictionary containing two lists of lists representing the training set, source and target. As produced by dataloader.
dev_source (list, optional) – List containing the source tokens from the development set, as produced by dev_tokens['source']
test_source (list, optional) – List containing the source tokens from the test set, as produced by test_tokens['source']
case_sensitive (bool, optional) – defaults to True. Sets whether the token comparison (for assigning the control task labels) is case-sensitive or case-insensitive.
sample_from (str, optional) – defaults to ‘same’. The distribution from which control task labels are sampled. ‘same’: Labels are sampled from the same distribution as the main task labels. ‘uniform’: Labels are sampled from a uniform distribution.

Returns:

control_task_tokens – A list with either one, two or three elements - depending on whether control task labels for only the train, or also dev and test set should be created. Each element of the list is a dictionary containing two lists, source and target. The source list is the same as from the tokens input. The target list is the list of control task labels.

Return type:

list

neurox.data.loader#

Loading functions for activations, input tokens/sentences and labels

This module contains functions to load activations as well as source files with tokens and labels. Functions that support tokenized data are also provided.

neurox.data.loader.load_activations(activations_path, num_neurons_per_layer=None, is_brnn=False, dtype=None)[source]#

Load extracted activations.

Parameters:

activations_path (str) – Path to the activations file. Can be of type t7, pt, acts, json or hdf5
num_neurons_per_layer (int, optional) – Number of neurons per layer - used to compute total number of layers. This is only necessary in the case of t7/p5/acts activations.
is_brnn (bool, optional) – If the model used to extract activations was bidirectional (default: False)
dtype (str, optional) – Only implemented for hdf5 and json files. Default: None None if the dtype of the activation should be the same dtype as in the activations file (only relevant for hdf5) ‘float16’ or ‘float32’ to enforce half-precision or full-precision floats

Returns:

activations (list of numpy.ndarray) – List of sentence representations, where each sentence representation is a numpy matrix of shape [num tokens in sentence x concatenated representation size]
num_layers (int) – Number of layers. This is usually representation_size/num_neurons_per_layer. Divide again by 2 if model was bidirectional

neurox.data.loader.filter_activations_by_layers(train_activations, test_activations, filter_layers, rnn_size, num_layers, is_brnn)[source]#

Filter activations so that they only contain specific layers.

Useful for performing layer-wise analysis.

Warning

This function is deprecated and will be removed in future versions.

Parameters:

train_activations (list of numpy.ndarray) – List of sentence representations from the train set, where each sentence representation is a numpy matrix of shape [NUM_TOKENS x NUM_NEURONS]. The method assumes that neurons from all layers are present, with the number of neurons in every layer given by rnn_size
test_activations (list of numpy.ndarray) – Similar to train_activations but with sentences from a test set.
filter_layers (str) – A comma-separated string of the form “f1,f2,f10”. “f” indicates a “forward” layer while “b” indicates a backword layer in a Bidirectional RNN. If the activations are from different kind of model, set is_brnn to False and provide only “f” entries. The number next to “f” is the layer number, 1-indexed. So “f1” corresponds to the embedding layer and so on.
rnn_size (int) – Number of neurons in every layer.
num_layers (int) – Total number of layers in the original model.
is_brnn (bool) – Boolean indicating if the neuron activations are from a bidirectional model.

Returns:

filtered_train_activations (list of numpy.ndarray) – Filtered train activations
filtered_test_activations (list of numpy.ndarray) – Filtered test activations

Notes

For bidirectional models, the method assumes that the internal structure is as follows: forward layer 1 neurons, backward layer 1 neurons, forward layer 2 neurons …

neurox.data.loader.load_aux_data(source_path, labels_path, source_aux_path, activations, max_sent_l, ignore_start_token=False)[source]#

Load word-annotated text-label pairs data represented as sentences, where activation extraction was performed on tokenized text. This function loads the source text, source tokenized text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and number of tokens in source_aux will match the number of activations at index N. The method will delete non-matching activation/source/source_aix/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.

Warning

This function is deprecated and will be removed in future versions.

Parameters:

source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.
source_aux_path (str) – Path to the source text file with tokenization, one sentence per line
activations (list of numpy.ndarray) – Activations returned from loader.load_activations
max_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.
ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.

Returns:

tokens – Dictionary containing three lists, source, source_aux and target. source contains all of the sentences from``source_path`` that were not ignored. source_aux contains all tokenized sentences from source_aux_path. target contains the parallel set of annotated labels.

Return type:

dict

neurox.data.loader.load_data(source_path, labels_path, activations, max_sent_l, ignore_start_token=False, sentence_classification=False)[source]#

Load word-annotated text-label pairs data represented as sentences. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and also match the number of activations at index N. The method will delete non-matching activation/source/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.

Parameters:

source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.
activations (list of numpy.ndarray) – Activations returned from loader.load_activations
max_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.
ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.
sentence_classification (bool, optional) – Flag to indicate if this is a sentence classification task, where every sentence actually has only a single activation (e.g. [CLS] token’s activations in the case of BERT)

Returns:

tokens – Dictionary containing two lists, source and target. source contains all of the sentences from source_path that were not ignored. target contains the parallel set of annotated labels.

Return type:

dict

neurox.data.loader.load_sentence_data(source_path, labels_path, activations)[source]#

Loads sentence-annotated text-label pairs. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of activations at index N. The method will delete non-matching activation/source pairs. The activations will be modified in place.

Parameters:

source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the source_path file.
activations (list of numpy.ndarray) – Activations returned from loader.load_activations

Returns:

Return type:

dict

neurox.data.representations#

Utility functions to manage representations.

This module contains functions that will help in managing extracted representations, specifically on sub-word based data.

neurox.data.representations.bpe_get_avg_activations(tokens, activations)[source]#

Aggregates activations by averaging assuming BPE-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by averaging over subwords.

Warning

This function is deprecated and will be removed in future versions.

Parameters:

tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.
activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns:

activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.

Return type:

list of numpy.ndarray

neurox.data.representations.bpe_get_last_activations(tokens, activations, is_brnn=True)[source]#

Aggregates activations by picking the last subword assuming BPE-based tokenization.

Warning

This function is deprecated and will be removed in future versions.

Parameters:

tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.
activations (list of numpy.ndarray) – Activations returned from loader.load_activations.
is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.

Returns:

activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.

Return type:

list of numpy.ndarray

neurox.data.representations.char_get_avg_activations(tokens, activations)[source]#

Aggregates activations by averaging assuming Character-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by averaging over characters.

Warning

This function is deprecated and will be removed in future versions.

Parameters:

tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.
activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns:

activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.

Return type:

list of numpy.ndarray

neurox.data.representations.char_get_last_activations(tokens, activations, is_brnn=True)[source]#

Aggregates activations by picking the last subword assuming Character-based tokenization.

Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by picking the last character for any given word.

Warning

This function is deprecated and will be removed in future versions.

Parameters:

tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.
activations (list of numpy.ndarray) – Activations returned from loader.load_activations.
is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.

Returns:

activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.

Return type:

list of numpy.ndarray

neurox.data.representations.sent_get_last_activations(tokens, activations)[source]#

Gets the summary vector for the input sentences.

Given loaded tokens data and activations, this function picks the final token’s activations for every sentence, essentially giving summary vectors for every sentence in the dataset. This is mostly applicable for RNNs.

Note

Bidirectionality is currently not handled in the case of BiRNNs.

Parameters:

tokens (dict) – Dictionary containing three lists, source, source_aux and target. Usually the output of data.loader.load_aux_data.
activations (list of numpy.ndarray) – Activations returned from loader.load_activations.

Returns:

activations – Summary activations corresponding to one per actual sentence in the original text.

Return type:

list of numpy.ndarray

neurox.data.utils#

neurox.data.utils.save_files(words, labels, activations, output_prefix, output_type='hdf5', decompose_layers=False, filter_layers=None)[source]#

Save word and label files in the text format and activations in the specified format (default hdf5 format)

Parameters:

words (list) – A list of words
labels (list) – A list of labels for every word
activations (list) – A list of word-wise activations
output_prefix (string) – Specify prefix of the output files

Return type:

Save word, label and activation files

neurox.data.writer#

Representations Writers

Module with various writers for saving representations/activations. Currently, two file types are supported:

hdf5: This is a binary format, and results in smaller overall files. The structure of the file is as follows:
- sentence_to_idx dataset: Contains a single json string at index 0 that maps sentences to indices
- Indices 0 through N-1 datasets: Each index corresponds to one sentence. The value of the dataset is a tensor with dimensions num_layers x sentence_length x embedding_size, where embedding_size may include multiple layers
json: This is a human-readable format. There is some loss of precision, since each activation value is saved using 8 decimal places. Concretely, this results in a jsonl file, where each line is a json string corresponding to a single sentence. The structure of each line is as follows:
- linex_idx: Sentence index
- features: List of tokens (with their activations)
  - token: The current token
  - layers: List of layers
    - index: Layer index (does not correspond to original model’s layers)
    - values: List of activation values for all neurons in the layer

The writers also support saving activations from specific layers only, using the filter_layers argument. Since activation files can be large, an additional option for decomposing the representations into layer-wise files is also provided.

class neurox.data.writer.ActivationsWriter(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#

Bases: object

Class that encapsulates all available writers.

This is the only class that should be used by the rest of the library.

filename#

Filename for storing the activations. May not be used exactly if decompose_layers is True.

Type:: str

filetype#

An additional hint for the filetype. This argument is optional The file type will be detected automatically from the filename if none is supplied.

Type:: str

decompose_layers#

Set to true if each layer’s activations should be saved in a separate file.

Type:: bool

filter_layers#

Comma separated list of layer indices to save.

Type:: str

__init__(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#

open()[source]#: Method to open the underlying files. Will be called automatically by the class instance when necessary.

write_activations(sentence_idx, extracted_words, activations)[source]#: Method to write a single sentence’s activations to file

close()[source]#: Method to close the udnerlying files.

static get_writer(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#: Method to get the correct writer based on filename and filetype

static add_writer_options(parser)[source]#: Method to return argparse arguments specific to activation writers

class neurox.data.writer.ActivationsWriterManager(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#

Bases: ActivationsWriter

Manager class that handles decomposition and filtering.

Decomposition requires multiple writers (one per file) and filtering requires processing the activations to remove unneeded layer activations. This class sits on top of the actual activations writer to manage these operations.

__init__(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#

open(num_layers)[source]#: Method to open the underlying files. Will be called automatically by the class instance when necessary.

write_activations(sentence_idx, extracted_words, activations)[source]#: Method to write a single sentence’s activations to file

close()[source]#: Method to close the udnerlying files.

class neurox.data.writer.HDF5ActivationsWriter(filename, dtype='float32')[source]#

Bases: ActivationsWriter

__init__(filename, dtype='float32')[source]#

open()[source]#: Method to open the underlying files. Will be called automatically by the class instance when necessary.

write_activations(sentence_idx, extracted_words, activations)[source]#: Method to write a single sentence’s activations to file

close()[source]#: Method to close the udnerlying files.

class neurox.data.writer.JSONActivationsWriter(filename, dtype='float32')[source]#

Bases: ActivationsWriter

__init__(filename, dtype='float32')[source]#

open()[source]#: Method to open the underlying files. Will be called automatically by the class instance when necessary.

write_activations(sentence_idx, extracted_words, activations)[source]#: Method to write a single sentence’s activations to file

close()[source]#: Method to close the udnerlying files.

Module contents:

neurox.data

Contents

neurox.data#

neurox.data.annotate#

neurox.data.control_task#

neurox.data.loader#

neurox.data.representations#

neurox.data.utils#

neurox.data.writer#