Class: HashingVectorizer
Convert a collection of text documents to a matrix of token occurrences.
It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
Constructors
new HashingVectorizer()
new HashingVectorizer(
opts?):HashingVectorizer
Parameters
| Parameter | Type | Description |
|---|---|---|
opts? | object | - |
opts.alternate_sign? | boolean | When true, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection. |
opts.analyzer? | "word" | "char" | "char_wb" | Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. |
opts.binary? | boolean | If true, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. |
opts.decode_error? | "strict" | "ignore" | "replace" | Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’. |
opts.dtype? | any | Type of the matrix returned by fit_transform() or transform(). |
opts.encoding? | string | If bytes or files are given to analyze, this encoding is used to decode. |
opts.input? | "filename" | "file" | "content" | If 'filename', the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze. |
opts.lowercase? | boolean | Convert all characters to lowercase before tokenizing. |
opts.n_features? | number | The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners. |
opts.ngram_range? | any | The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable. |
opts.norm? | "l1" | "l2" | Norm used to normalize term vectors. undefined for no normalization. |
opts.preprocessor? | any | Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable. |
opts.stop_words? | any[] | "english" | If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer \== 'word'. |
opts.strip_accents? | "ascii" | "unicode" | Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have a direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any character. undefined (default) means no character normalization is performed. Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize. |
opts.token_pattern? | string | Regular expression denoting what constitutes a “token”, only used if analyzer \== 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted. |
opts.tokenizer? | any | Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer \== 'word'. |
Returns HashingVectorizer
Defined in generated/feature_extraction/text/HashingVectorizer.ts:27
Properties
| Property | Type | Default value | Defined in |
|---|---|---|---|
_isDisposed | boolean | false | generated/feature_extraction/text/HashingVectorizer.ts:25 |
_isInitialized | boolean | false | generated/feature_extraction/text/HashingVectorizer.ts:24 |
_py | PythonBridge | undefined | generated/feature_extraction/text/HashingVectorizer.ts:23 |
id | string | undefined | generated/feature_extraction/text/HashingVectorizer.ts:20 |
opts | any | undefined | generated/feature_extraction/text/HashingVectorizer.ts:21 |
Accessors
py
Get Signature
get py():
PythonBridge
Returns PythonBridge
Set Signature
set py(
pythonBridge):void
Parameters
| Parameter | Type |
|---|---|
pythonBridge | PythonBridge |
Returns void
Defined in generated/feature_extraction/text/HashingVectorizer.ts:136
Methods
build_analyzer()
build_analyzer(
opts):Promise<any>
Return a callable to process input data.
The callable handles preprocessing, tokenization, and n-grams generation.
Parameters
| Parameter | Type |
|---|---|
opts | object |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:209
build_preprocessor()
build_preprocessor(
opts):Promise<any>
Return a function to preprocess the text before tokenization.
Parameters
| Parameter | Type |
|---|---|
opts | object |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:239
build_tokenizer()
build_tokenizer(
opts):Promise<any>
Return a function that splits a string into a sequence of tokens.
Parameters
| Parameter | Type |
|---|---|
opts | object |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:269
decode()
decode(
opts):Promise<any>
Decode the input into a string of unicode symbols.
The decoding strategy depends on the vectorizer parameters.
Parameters
| Parameter | Type | Description |
|---|---|---|
opts | object | - |
opts.doc? | string | The string to decode. |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:301
dispose()
dispose():
Promise<void>
Disposes of the underlying Python resources.
Once dispose() is called, the instance is no longer usable.
Returns Promise<void>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:190
fit()
fit(
opts):Promise<any>
Only validates estimator’s parameters.
This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.
Parameters
| Parameter | Type | Description |
|---|---|---|
opts | object | - |
opts.X? | any | Training data. |
opts.y? | any | Not used, present for API consistency by convention. |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:337
fit_transform()
fit_transform(
opts):Promise<any[]>
Transform a sequence of documents to a document-term matrix.
Parameters
| Parameter | Type | Description |
|---|---|---|
opts | object | - |
opts.X? | any | Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed. |
opts.y? | any | Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline. |
Returns Promise<any[]>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:376
get_metadata_routing()
get_metadata_routing(
opts):Promise<any>
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
Parameters
| Parameter | Type | Description |
|---|---|---|
opts | object | - |
opts.routing? | any | A MetadataRequest encapsulating routing information. |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:419
get_stop_words()
get_stop_words(
opts):Promise<any>
Build or fetch the effective stop words list.
Parameters
| Parameter | Type |
|---|---|
opts | object |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:455
init()
init(
py):Promise<void>
Initializes the underlying Python resources.
This instance is not usable until the Promise returned by init() resolves.
Parameters
| Parameter | Type |
|---|---|
py | PythonBridge |
Returns Promise<void>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:149
partial_fit()
partial_fit(
opts):Promise<any>
Only validates estimator’s parameters.
This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.
Parameters
| Parameter | Type | Description |
|---|---|---|
opts | object | - |
opts.X? | any | Training data. |
opts.y? | any | Not used, present for API consistency by convention. |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:487
set_output()
set_output(
opts):Promise<any>
Set output container.
See Introducing the set_output API for an example on how to use the API.
Parameters
| Parameter | Type | Description |
|---|---|---|
opts | object | - |
opts.transform? | "default" | "pandas" | "polars" | Configure output of transform and fit_transform. |
Returns Promise<any>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:528
transform()
transform(
opts):Promise<any[]>
Transform a sequence of documents to a document-term matrix.
Parameters
| Parameter | Type | Description |
|---|---|---|
opts | object | - |
opts.X? | any | Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed. |
Returns Promise<any[]>
Defined in generated/feature_extraction/text/HashingVectorizer.ts:562