Class: HashingVectorizer
Convert a collection of text documents to a matrix of token occurrences.
It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
Constructors
new HashingVectorizer()
new HashingVectorizer(
opts
?):HashingVectorizer
Parameters
Parameter | Type | Description |
---|---|---|
opts ? | object | - |
opts.alternate_sign ? | boolean | When true , an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection. |
opts.analyzer ? | "word" | "char" | "char_wb" | Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. |
opts.binary ? | boolean | If true , all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. |
opts.decode_error ? | "strict" | "ignore" | "replace" | Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding . By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’. |
opts.dtype ? | any | Type of the matrix returned by fit_transform() or transform(). |
opts.encoding ? | string | If bytes or files are given to analyze, this encoding is used to decode. |
opts.input ? | "filename" | "file" | "content" | If 'filename' , the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze. |
opts.lowercase ? | boolean | Convert all characters to lowercase before tokenizing. |
opts.n_features ? | number | The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners. |
opts.ngram_range ? | any | The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable. |
opts.norm ? | "l1" | "l2" | Norm used to normalize term vectors. undefined for no normalization. |
opts.preprocessor ? | any | Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable. |
opts.stop_words ? | any [] | "english" | If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer \== 'word' . |
opts.strip_accents ? | "ascii" | "unicode" | Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have a direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any character. undefined (default) means no character normalization is performed. Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize . |
opts.token_pattern ? | string | Regular expression denoting what constitutes a “token”, only used if analyzer \== 'word' . The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted. |
opts.tokenizer ? | any | Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer \== 'word' . |
Returns HashingVectorizer
Defined in generated/feature_extraction/text/HashingVectorizer.ts:27
Properties
Property | Type | Default value | Defined in |
---|---|---|---|
_isDisposed | boolean | false | generated/feature_extraction/text/HashingVectorizer.ts:25 |
_isInitialized | boolean | false | generated/feature_extraction/text/HashingVectorizer.ts:24 |
_py | PythonBridge | undefined | generated/feature_extraction/text/HashingVectorizer.ts:23 |
id | string | undefined | generated/feature_extraction/text/HashingVectorizer.ts:20 |
opts | any | undefined | generated/feature_extraction/text/HashingVectorizer.ts:21 |
Accessors
py
Get Signature
get py():
PythonBridge
Returns PythonBridge
Set Signature
set py(
pythonBridge
):void
Parameters
Parameter | Type |
---|---|
pythonBridge | PythonBridge |
Returns void
Defined in generated/feature_extraction/text/HashingVectorizer.ts:136
Methods
build_analyzer()
build_analyzer(
opts
):Promise
<any
>
Return a callable to process input data.
The callable handles preprocessing, tokenization, and n-grams generation.
Parameters
Parameter | Type |
---|---|
opts | object |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:209
build_preprocessor()
build_preprocessor(
opts
):Promise
<any
>
Return a function to preprocess the text before tokenization.
Parameters
Parameter | Type |
---|---|
opts | object |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:239
build_tokenizer()
build_tokenizer(
opts
):Promise
<any
>
Return a function that splits a string into a sequence of tokens.
Parameters
Parameter | Type |
---|---|
opts | object |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:269
decode()
decode(
opts
):Promise
<any
>
Decode the input into a string of unicode symbols.
The decoding strategy depends on the vectorizer parameters.
Parameters
Parameter | Type | Description |
---|---|---|
opts | object | - |
opts.doc ? | string | The string to decode. |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:301
dispose()
dispose():
Promise
<void
>
Disposes of the underlying Python resources.
Once dispose()
is called, the instance is no longer usable.
Returns Promise
<void
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:190
fit()
fit(
opts
):Promise
<any
>
Only validates estimator’s parameters.
This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.
Parameters
Parameter | Type | Description |
---|---|---|
opts | object | - |
opts.X ? | any | Training data. |
opts.y ? | any | Not used, present for API consistency by convention. |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:337
fit_transform()
fit_transform(
opts
):Promise
<any
[]>
Transform a sequence of documents to a document-term matrix.
Parameters
Parameter | Type | Description |
---|---|---|
opts | object | - |
opts.X ? | any | Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed. |
opts.y ? | any | Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline. |
Returns Promise
<any
[]>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:376
get_metadata_routing()
get_metadata_routing(
opts
):Promise
<any
>
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Parameters
Parameter | Type | Description |
---|---|---|
opts | object | - |
opts.routing ? | any | A MetadataRequest encapsulating routing information. |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:419
get_stop_words()
get_stop_words(
opts
):Promise
<any
>
Build or fetch the effective stop words list.
Parameters
Parameter | Type |
---|---|
opts | object |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:455
init()
init(
py
):Promise
<void
>
Initializes the underlying Python resources.
This instance is not usable until the Promise
returned by init()
resolves.
Parameters
Parameter | Type |
---|---|
py | PythonBridge |
Returns Promise
<void
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:149
partial_fit()
partial_fit(
opts
):Promise
<any
>
Only validates estimator’s parameters.
This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.
Parameters
Parameter | Type | Description |
---|---|---|
opts | object | - |
opts.X ? | any | Training data. |
opts.y ? | any | Not used, present for API consistency by convention. |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:487
set_output()
set_output(
opts
):Promise
<any
>
Set output container.
See Introducing the set_output API for an example on how to use the API.
Parameters
Parameter | Type | Description |
---|---|---|
opts | object | - |
opts.transform ? | "default" | "pandas" | "polars" | Configure output of transform and fit_transform . |
Returns Promise
<any
>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:528
transform()
transform(
opts
):Promise
<any
[]>
Transform a sequence of documents to a document-term matrix.
Parameters
Parameter | Type | Description |
---|---|---|
opts | object | - |
opts.X ? | any | Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed. |
Returns Promise
<any
[]>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:562