pyabsa.utils.text_utils.word2vec

Module Contents

Functions

train_word2vec([corpus_files, save_path, vector_dim, ...])

Train a Word2Vec model on a given corpus and save the resulting model and vectors to disk.

Attributes

tokenizer

pyabsa.utils.text_utils.word2vec.train_word2vec(corpus_files: List[str] = None, save_path: str = 'word2vec', vector_dim: int = 300, window: int = 5, min_count: int = 1000, skip_gram: int = 1, num_workers: int = None, epochs: int = 10, pre_tokenizer: str = None, **kwargs)[source]

Train a Word2Vec model on a given corpus and save the resulting model and vectors to disk.

Args: - corpus_files: a list of file paths for the input corpus - save_path: the directory where the model and vectors will be saved - vector_dim: the dimension of the resulting word vectors - window: the size of the window used for context - min_count: the minimum count of a word for it to be included in the model - skip_gram: whether to use skip-gram (1) or CBOW (0) algorithm - num_workers: the number of worker threads to use (default: CPU count - 1) - epochs: the number of iterations over the corpus - pre_tokenizer: the name of a tokenizer to use for preprocessing (optional)

pyabsa.utils.text_utils.word2vec.tokenizer[source]