DocumentationClassesHashingVectorizer

Class: HashingVectorizer

Convert a collection of text documents to a matrix of token occurrences.

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

Python Reference

Constructors

new HashingVectorizer()

new HashingVectorizer(opts?): HashingVectorizer

Parameters

ParameterTypeDescription
opts?object-
opts.alternate_sign?booleanWhen true, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.
opts.analyzer?"word" | "char" | "char_wb"Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
opts.binary?booleanIf true, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
opts.decode_error?"strict" | "ignore" | "replace"Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
opts.dtype?anyType of the matrix returned by fit_transform() or transform().
opts.encoding?stringIf bytes or files are given to analyze, this encoding is used to decode.
opts.input?"filename" | "file" | "content"If 'filename', the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.
opts.lowercase?booleanConvert all characters to lowercase before tokenizing.
opts.n_features?numberThe number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
opts.ngram_range?anyThe lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.
opts.norm?"l1" | "l2"Norm used to normalize term vectors. undefined for no normalization.
opts.preprocessor?anyOverride the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable.
opts.stop_words?any[] | "english"If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer \== 'word'.
opts.strip_accents?"ascii" | "unicode"Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have a direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any character. undefined (default) means no character normalization is performed. Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.
opts.token_pattern?stringRegular expression denoting what constitutes a “token”, only used if analyzer \== 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.
opts.tokenizer?anyOverride the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer \== 'word'.

Returns HashingVectorizer

Defined in generated/feature_extraction/text/HashingVectorizer.ts:27

Properties

PropertyTypeDefault valueDefined in
_isDisposedbooleanfalsegenerated/feature_extraction/text/HashingVectorizer.ts:25
_isInitializedbooleanfalsegenerated/feature_extraction/text/HashingVectorizer.ts:24
_pyPythonBridgeundefinedgenerated/feature_extraction/text/HashingVectorizer.ts:23
idstringundefinedgenerated/feature_extraction/text/HashingVectorizer.ts:20
optsanyundefinedgenerated/feature_extraction/text/HashingVectorizer.ts:21

Accessors

py

Get Signature

get py(): PythonBridge

Returns PythonBridge

Set Signature

set py(pythonBridge): void

Parameters

ParameterType
pythonBridgePythonBridge

Returns void

Defined in generated/feature_extraction/text/HashingVectorizer.ts:136

Methods

build_analyzer()

build_analyzer(opts): Promise<any>

Return a callable to process input data.

The callable handles preprocessing, tokenization, and n-grams generation.

Parameters

ParameterType
optsobject

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:209


build_preprocessor()

build_preprocessor(opts): Promise<any>

Return a function to preprocess the text before tokenization.

Parameters

ParameterType
optsobject

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:239


build_tokenizer()

build_tokenizer(opts): Promise<any>

Return a function that splits a string into a sequence of tokens.

Parameters

ParameterType
optsobject

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:269


decode()

decode(opts): Promise<any>

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.

Parameters

ParameterTypeDescription
optsobject-
opts.doc?stringThe string to decode.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:301


dispose()

dispose(): Promise<void>

Disposes of the underlying Python resources.

Once dispose() is called, the instance is no longer usable.

Returns Promise<void>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:190


fit()

fit(opts): Promise<any>

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Parameters

ParameterTypeDescription
optsobject-
opts.X?anyTraining data.
opts.y?anyNot used, present for API consistency by convention.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:337


fit_transform()

fit_transform(opts): Promise<any[]>

Transform a sequence of documents to a document-term matrix.

Parameters

ParameterTypeDescription
optsobject-
opts.X?anySamples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.
opts.y?anyIgnored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns Promise<any[]>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:376


get_metadata_routing()

get_metadata_routing(opts): Promise<any>

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Parameters

ParameterTypeDescription
optsobject-
opts.routing?anyA MetadataRequest encapsulating routing information.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:419


get_stop_words()

get_stop_words(opts): Promise<any>

Build or fetch the effective stop words list.

Parameters

ParameterType
optsobject

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:455


init()

init(py): Promise<void>

Initializes the underlying Python resources.

This instance is not usable until the Promise returned by init() resolves.

Parameters

ParameterType
pyPythonBridge

Returns Promise<void>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:149


partial_fit()

partial_fit(opts): Promise<any>

Only validates estimator’s parameters.

This method allows to: (i) validate the estimator’s parameters and (ii) be consistent with the scikit-learn transformer API.

Parameters

ParameterTypeDescription
optsobject-
opts.X?anyTraining data.
opts.y?anyNot used, present for API consistency by convention.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:487


set_output()

set_output(opts): Promise<any>

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters

ParameterTypeDescription
optsobject-
opts.transform?"default" | "pandas" | "polars"Configure output of transform and fit_transform.

Returns Promise<any>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:528


transform()

transform(opts): Promise<any[]>

Transform a sequence of documents to a document-term matrix.

Parameters

ParameterTypeDescription
optsobject-
opts.X?anySamples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

Returns Promise<any[]>

Defined in generated/feature_extraction/text/HashingVectorizer.ts:562