neurox.data#
Subpackages:
Submodules:
neurox.data.annotate#
Given a list of sentences, their activations and a pattern, create a binary labeled dataset based on the pattern where pattern can be a regular expression, a list of words and a function. For example, one can create a binary dataset of years vs. not-years (2004 vs. this) by specifying the regular expression that matches the pattern of year. The program will extract positive class examples based on the provided filter and will consider rest of the examples as negative class examples. The output of the program is a word file, a label file and an activation file.
- neurox.data.annotate.annotate_data(source_path, activations_path, binary_filter, output_prefix, output_type='hdf5', decompose_layers=False, filter_layers=None)[source]#
Given a set of sentences, per word activations, a binary_filter and output_prefix, creates binary data and save it to the disk. A binary filter can be a set of words, a regex object or a function
- Parameters:
source_path (text file with one sentence per line) –
activations (list) – A list of sentence-wise activations
binary_filter (a set of words or a regex object or a function) –
output_prefix (prefix of the output files that will be saved as the output of this script) –
- Return type:
Saves a word file, a binary label file and their activations
Example
annotate_data(source_path, activations_path, re.compile(r’^ww$’)) select words of two characters only as a positive class annotate_data(source_path, activations_path, {‘is’, ‘can’}) select occrrences of ‘is’ and ‘can’ as a positive class
neurox.data.control_task#
- neurox.data.control_task.create_sequence_labeling_dataset(train_tokens, dev_source=None, test_source=None, case_sensitive=True, sample_from='same')[source]#
Method that prepares labels for a control task, as defined in §2.1 of Hewitt and Liang (2019) <https://aclanthology.org/D19-1275.pdf>
Target classes are selected randomly for each token type in the datasets. The number of control task classes is the same as the number of classes in
train_tokens['target']. The distribution of control task labels can be specified.- Parameters:
train_tokens (dict) – Dictionary containing two lists of lists representing the training set,
sourceandtarget. As produced bydataloader.dev_source (list, optional) – List containing the
sourcetokens from the development set, as produced bydev_tokens['source']test_source (list, optional) – List containing the
sourcetokens from the test set, as produced bytest_tokens['source']case_sensitive (bool, optional) – defaults to True. Sets whether the token comparison (for assigning the control task labels) is case-sensitive or case-insensitive.
sample_from (str, optional) – defaults to ‘same’. The distribution from which control task labels are sampled. ‘same’: Labels are sampled from the same distribution as the main task labels. ‘uniform’: Labels are sampled from a uniform distribution.
- Returns:
control_task_tokens – A list with either one, two or three elements - depending on whether control task labels for only the train, or also dev and test set should be created. Each element of the list is a dictionary containing two lists,
sourceandtarget. Thesourcelist is the same as from thetokensinput. Thetargetlist is the list of control task labels.- Return type:
list
neurox.data.loader#
Loading functions for activations, input tokens/sentences and labels
This module contains functions to load activations as well as source files with tokens and labels. Functions that support tokenized data are also provided.
- neurox.data.loader.load_activations(activations_path, num_neurons_per_layer=None, is_brnn=False, dtype=None)[source]#
Load extracted activations.
- Parameters:
activations_path (str) – Path to the activations file. Can be of type t7, pt, acts, json or hdf5
num_neurons_per_layer (int, optional) – Number of neurons per layer - used to compute total number of layers. This is only necessary in the case of t7/p5/acts activations.
is_brnn (bool, optional) – If the model used to extract activations was bidirectional (default: False)
dtype (str, optional) – Only implemented for hdf5 and json files. Default: None None if the dtype of the activation should be the same dtype as in the activations file (only relevant for hdf5) ‘float16’ or ‘float32’ to enforce half-precision or full-precision floats
- Returns:
activations (list of numpy.ndarray) – List of sentence representations, where each sentence representation is a numpy matrix of shape
[num tokens in sentence x concatenated representation size]num_layers (int) – Number of layers. This is usually representation_size/num_neurons_per_layer. Divide again by 2 if model was bidirectional
- neurox.data.loader.filter_activations_by_layers(train_activations, test_activations, filter_layers, rnn_size, num_layers, is_brnn)[source]#
Filter activations so that they only contain specific layers.
Useful for performing layer-wise analysis.
Warning
This function is deprecated and will be removed in future versions.
- Parameters:
train_activations (list of numpy.ndarray) – List of sentence representations from the train set, where each sentence representation is a numpy matrix of shape
[NUM_TOKENS x NUM_NEURONS]. The method assumes that neurons from all layers are present, with the number of neurons in every layer given byrnn_sizetest_activations (list of numpy.ndarray) – Similar to
train_activationsbut with sentences from a test set.filter_layers (str) – A comma-separated string of the form “f1,f2,f10”. “f” indicates a “forward” layer while “b” indicates a backword layer in a Bidirectional RNN. If the activations are from different kind of model, set
is_brnntoFalseand provide only “f” entries. The number next to “f” is the layer number, 1-indexed. So “f1” corresponds to the embedding layer and so on.rnn_size (int) – Number of neurons in every layer.
num_layers (int) – Total number of layers in the original model.
is_brnn (bool) – Boolean indicating if the neuron activations are from a bidirectional model.
- Returns:
filtered_train_activations (list of numpy.ndarray) – Filtered train activations
filtered_test_activations (list of numpy.ndarray) – Filtered test activations
Notes
For bidirectional models, the method assumes that the internal structure is as follows: forward layer 1 neurons, backward layer 1 neurons, forward layer 2 neurons …
- neurox.data.loader.load_aux_data(source_path, labels_path, source_aux_path, activations, max_sent_l, ignore_start_token=False)[source]#
Load word-annotated text-label pairs data represented as sentences, where activation extraction was performed on tokenized text. This function loads the source text, source tokenized text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and number of tokens in source_aux will match the number of activations at index N. The method will delete non-matching activation/source/source_aix/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.
Warning
This function is deprecated and will be removed in future versions.
- Parameters:
source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the
source_pathfile.source_aux_path (str) – Path to the source text file with tokenization, one sentence per line
activations (list of numpy.ndarray) – Activations returned from
loader.load_activationsmax_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.
ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.
- Returns:
tokens – Dictionary containing three lists,
source,source_auxandtarget.sourcecontains all of the sentences from``source_path`` that were not ignored.source_auxcontains all tokenized sentences fromsource_aux_path.targetcontains the parallel set of annotated labels.- Return type:
dict
- neurox.data.loader.load_data(source_path, labels_path, activations, max_sent_l, ignore_start_token=False, sentence_classification=False)[source]#
Load word-annotated text-label pairs data represented as sentences. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of tokens in line N of target, and also match the number of activations at index N. The method will delete non-matching activation/source/target pairs, up to a maximum of 100 before failing. The method will also ignore sentences longer than the provided maximum. The activations will be modified in place.
- Parameters:
source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the
source_pathfile.activations (list of numpy.ndarray) – Activations returned from
loader.load_activationsmax_sent_l (int) – Maximum length of sentences. Sentences containing more tokens will be ignored.
ignore_start_token (bool, optional) – Ignore the first token. Useful if there is some line position markers in the source text.
sentence_classification (bool, optional) – Flag to indicate if this is a sentence classification task, where every sentence actually has only a single activation (e.g. [CLS] token’s activations in the case of BERT)
- Returns:
tokens – Dictionary containing two lists,
sourceandtarget.sourcecontains all of the sentences fromsource_paththat were not ignored.targetcontains the parallel set of annotated labels.- Return type:
dict
- neurox.data.loader.load_sentence_data(source_path, labels_path, activations)[source]#
Loads sentence-annotated text-label pairs. This function loads the source text, target labels, and activations and tries to make them perfectly parallel, i.e. number of tokens in line N of source would match the number of activations at index N. The method will delete non-matching activation/source pairs. The activations will be modified in place.
- Parameters:
source_path (str) – Path to the source text file, one sentence per line
labels_path (str) – Path to the annotated labels file, one sentence per line corresponding to the sentences in the
source_pathfile.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations
- Returns:
tokens – Dictionary containing two lists,
sourceandtarget.sourcecontains all of the sentences fromsource_paththat were not ignored.targetcontains the parallel set of annotated labels.- Return type:
dict
neurox.data.representations#
Utility functions to manage representations.
This module contains functions that will help in managing extracted representations, specifically on sub-word based data.
- neurox.data.representations.bpe_get_avg_activations(tokens, activations)[source]#
Aggregates activations by averaging assuming BPE-based tokenization.
Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by averaging over subwords.
Warning
This function is deprecated and will be removed in future versions.
- Parameters:
tokens (dict) – Dictionary containing three lists,
source,source_auxandtarget. Usually the output ofdata.loader.load_aux_data.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations.
- Returns:
activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.
- Return type:
list of numpy.ndarray
- neurox.data.representations.bpe_get_last_activations(tokens, activations, is_brnn=True)[source]#
Aggregates activations by picking the last subword assuming BPE-based tokenization.
Given loaded tokens data and activations, this function aggeregates activations based on tokenized text. BPE based tokenization is assumed, with every non-terminal subword ending with “@@”. The activations are aggregated by picking the last subword for any given word.
Warning
This function is deprecated and will be removed in future versions.
- Parameters:
tokens (dict) – Dictionary containing three lists,
source,source_auxandtarget. Usually the output ofdata.loader.load_aux_data.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations.is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.
- Returns:
activations – Subword aggregated activations corresponding to one per actual token found in the untokenized text.
- Return type:
list of numpy.ndarray
- neurox.data.representations.char_get_avg_activations(tokens, activations)[source]#
Aggregates activations by averaging assuming Character-based tokenization.
Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by averaging over characters.
Warning
This function is deprecated and will be removed in future versions.
- Parameters:
tokens (dict) – Dictionary containing three lists,
source,source_auxandtarget. Usually the output ofdata.loader.load_aux_data.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations.
- Returns:
activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.
- Return type:
list of numpy.ndarray
- neurox.data.representations.char_get_last_activations(tokens, activations, is_brnn=True)[source]#
Aggregates activations by picking the last subword assuming Character-based tokenization.
Given loaded tokens data and activations, this function aggeregates activations based on character-tokenized text. The activations are aggregated by picking the last character for any given word.
Warning
This function is deprecated and will be removed in future versions.
- Parameters:
tokens (dict) – Dictionary containing three lists,
source,source_auxandtarget. Usually the output ofdata.loader.load_aux_data.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations.is_brnn (bool, optional) – Whether the model from which activations were extracted was bidirectional. Only applies for RNN models.
- Returns:
activations – Character aggregated activations corresponding to one per actual token found in the untokenized text.
- Return type:
list of numpy.ndarray
- neurox.data.representations.sent_get_last_activations(tokens, activations)[source]#
Gets the summary vector for the input sentences.
Given loaded tokens data and activations, this function picks the final token’s activations for every sentence, essentially giving summary vectors for every sentence in the dataset. This is mostly applicable for RNNs.
Note
Bidirectionality is currently not handled in the case of BiRNNs.
- Parameters:
tokens (dict) – Dictionary containing three lists,
source,source_auxandtarget. Usually the output ofdata.loader.load_aux_data.activations (list of numpy.ndarray) – Activations returned from
loader.load_activations.
- Returns:
activations – Summary activations corresponding to one per actual sentence in the original text.
- Return type:
list of numpy.ndarray
neurox.data.utils#
- neurox.data.utils.save_files(words, labels, activations, output_prefix, output_type='hdf5', decompose_layers=False, filter_layers=None)[source]#
Save word and label files in the text format and activations in the specified format (default hdf5 format)
- Parameters:
words (list) – A list of words
labels (list) – A list of labels for every word
activations (list) – A list of word-wise activations
output_prefix (string) – Specify prefix of the output files
- Return type:
Save word, label and activation files
neurox.data.writer#
Representations Writers
Module with various writers for saving representations/activations. Currently, two file types are supported:
hdf5: This is a binary format, and results in smaller overall files. The structure of the file is as follows:sentence_to_idxdataset: Contains a single json string at index 0 that maps sentences to indicesIndices
0throughN-1datasets: Each index corresponds to one sentence. The value of the dataset is a tensor with dimensionsnum_layers x sentence_length x embedding_size, whereembedding_sizemay include multiple layers
json: This is a human-readable format. There is some loss of precision, since each activation value is saved using 8 decimal places. Concretely, this results in a jsonl file, where each line is a json string corresponding to a single sentence. The structure of each line is as follows:linex_idx: Sentence indexfeatures: List of tokens (with their activations)token: The current tokenlayers: List of layersindex: Layer index (does not correspond to original model’s layers)values: List of activation values for all neurons in the layer
The writers also support saving activations from specific layers only, using the
filter_layers argument. Since activation files can be large, an additional
option for decomposing the representations into layer-wise files is also
provided.
- class neurox.data.writer.ActivationsWriter(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#
Bases:
objectClass that encapsulates all available writers.
This is the only class that should be used by the rest of the library.
- filename#
Filename for storing the activations. May not be used exactly if
decompose_layersis True.- Type:
str
- filetype#
An additional hint for the filetype. This argument is optional The file type will be detected automatically from the filename if none is supplied.
- Type:
str
- decompose_layers#
Set to true if each layer’s activations should be saved in a separate file.
- Type:
bool
- filter_layers#
Comma separated list of layer indices to save.
- Type:
str
- __init__(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#
- open()[source]#
Method to open the underlying files. Will be called automatically by the class instance when necessary.
- write_activations(sentence_idx, extracted_words, activations)[source]#
Method to write a single sentence’s activations to file
- class neurox.data.writer.ActivationsWriterManager(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#
Bases:
ActivationsWriterManager class that handles decomposition and filtering.
Decomposition requires multiple writers (one per file) and filtering requires processing the activations to remove unneeded layer activations. This class sits on top of the actual activations writer to manage these operations.
- __init__(filename, filetype=None, decompose_layers=False, filter_layers=None, dtype='float32')[source]#
- open(num_layers)[source]#
Method to open the underlying files. Will be called automatically by the class instance when necessary.
- class neurox.data.writer.HDF5ActivationsWriter(filename, dtype='float32')[source]#
Bases:
ActivationsWriter- open()[source]#
Method to open the underlying files. Will be called automatically by the class instance when necessary.
- class neurox.data.writer.JSONActivationsWriter(filename, dtype='float32')[source]#
Bases:
ActivationsWriter- open()[source]#
Method to open the underlying files. Will be called automatically by the class instance when necessary.
Module contents: