Class: HashingVectorizer

Convert a collection of text documents to a matrix of token occurrences.

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

Python Reference

Constructors

new HashingVectorizer()

new HashingVectorizer(opts?): HashingVectorizer

Parameters

Parameter	Type	Description
`opts`?	`object`	-
`opts.alternate_sign`?	`boolean`	When `true`, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.
`opts.analyzer`?	`"word"` \| `"char"` \| `"char_wb"`	Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
`opts.binary`?	`boolean`	If `true`, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
`opts.decode_error`?	`"strict"` \| `"ignore"` \| `"replace"`	Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
`opts.dtype`?	`any`	Type of the matrix returned by fit_transform() or transform().
`opts.encoding`?	`string`	If bytes or files are given to analyze, this encoding is used to decode.
`opts.input`?	`"filename"` \| `"file"` \| `"content"`	If `'filename'`, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.
`opts.lowercase`?	`boolean`	Convert all characters to lowercase before tokenizing.
`opts.n_features`?	`number`	The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
`opts.ngram_range`?	`any`	The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an `ngram_range` of `(1, 1)` means only unigrams, `(1, 2)` means unigrams and bigrams, and `(2, 2)` means only bigrams. Only applies if `analyzer` is not callable.
`opts.norm`?	`"l1"` \| `"l2"`	Norm used to normalize term vectors. `undefined` for no normalization.
`opts.preprocessor`?	`any`	Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if `analyzer` is not callable.
`opts.stop_words`?	`any`[] \| `"english"`	If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if `analyzer \== 'word'`.
`opts.strip_accents`?	`"ascii"` \| `"unicode"`	Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have a direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any character. `undefined` (default) means no character normalization is performed. Both ‘ascii’ and ‘unicode’ use NFKD normalization from `unicodedata.normalize`.
`opts.token_pattern`?	`string`	Regular expression denoting what constitutes a “token”, only used if `analyzer \== 'word'`. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.
`opts.tokenizer`?	`any`	Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if `analyzer \== 'word'`.

Returns HashingVectorizer

Defined in generated/feature_extraction/text/HashingVectorizer.ts:27

Properties

Property	Type	Default value	Defined in
`_isDisposed`	`boolean`	`false`	generated/feature_extraction/text/HashingVectorizer.ts:25
`_isInitialized`	`boolean`	`false`	generated/feature_extraction/text/HashingVectorizer.ts:24
`_py`	`PythonBridge`	`undefined`	generated/feature_extraction/text/HashingVectorizer.ts:23
`id`	`string`	`undefined`	generated/feature_extraction/text/HashingVectorizer.ts:20
`opts`	`any`	`undefined`	generated/feature_extraction/text/HashingVectorizer.ts:21

Accessors

py

Get Signature

get py(): PythonBridge

Returns PythonBridge

Set Signature

set py(pythonBridge): void

Parameters

Parameter	Type
`pythonBridge`	`PythonBridge`

Returns void

Defined in generated/feature_extraction/text/HashingVectorizer.ts:136

Methods

build_analyzer()

build_analyzer(opts): Promise<any>

Return a callable to process input data.

The callable handles preprocessing, tokenization, and n-grams generation.

Parameters

Parameter	Type
`opts`	`object`

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:209

build_preprocessor()

build_preprocessor(opts): Promise<any>

Return a function to preprocess the text before tokenization.

Parameters

Parameter	Type
`opts`	`object`

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:239

build_tokenizer()

build_tokenizer(opts): Promise<any>

Return a function that splits a string into a sequence of tokens.

Parameters

Parameter	Type
`opts`	`object`

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:269

decode()

decode(opts): Promise<any>

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.

Parameters

Parameter	Type	Description
`opts`	`object`	-
`opts.doc`?	`string`	The string to decode.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:301

dispose()

dispose(): Promise<void>

Disposes of the underlying Python resources.

Once dispose() is called, the instance is no longer usable.

Returns Promise<void>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:190

fit()

fit(opts): Promise<any>

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Parameters

Parameter	Type	Description
`opts`	`object`	-
`opts.X`?	`any`	Training data.
`opts.y`?	`any`	Not used, present for API consistency by convention.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:337

fit_transform()

fit_transform(opts): Promise<any[]>

Transform a sequence of documents to a document-term matrix.

Parameters

Parameter	Type	Description
`opts`	`object`	-
`opts.X`?	`any`	Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.
`opts.y`?	`any`	Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns Promise<any[]>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:376

get_metadata_routing()

get_metadata_routing(opts): Promise<any>

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Parameters

Parameter	Type	Description
`opts`	`object`	-
`opts.routing`?	`any`	A `MetadataRequest` encapsulating routing information.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:419

get_stop_words()

get_stop_words(opts): Promise<any>

Build or fetch the effective stop words list.

Parameters

Parameter	Type
`opts`	`object`

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:455

init()

init(py): Promise<void>

Initializes the underlying Python resources.

This instance is not usable until the Promise returned by init() resolves.

Parameters

Parameter	Type
`py`	`PythonBridge`

Returns Promise<void>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:149

partial_fit()

partial_fit(opts): Promise<any>

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Parameters

Parameter	Type	Description
`opts`	`object`	-
`opts.X`?	`any`	Training data.
`opts.y`?	`any`	Not used, present for API consistency by convention.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:487

set_output()

set_output(opts): Promise<any>

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters

Parameter	Type	Description
`opts`	`object`	-
`opts.transform`?	`"default"` \| `"pandas"` \| `"polars"`	Configure output of `transform` and `fit_transform`.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:528

transform()

transform(opts): Promise<any[]>

Transform a sequence of documents to a document-term matrix.

Parameters

Parameter	Type	Description
`opts`	`object`	-
`opts.X`?	`any`	Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

Returns Promise<any[]>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:562

HalvingRandomSearchCV HDBSCAN