pyrdf2vec.embedders.fasttext module¶

class pyrdf2vec.embedders.fasttext.FastText(**kwargs)¶

Bases: pyrdf2vec.embedders.embedder.Embedder

Defines the FastText embedding technique.

SEE: https://radimrehurek.com/gensim/models/fasttext.html

The RDF2Vec implementation of FastText does not consider the min_n and max_n parameters for n_gram splitting.

This implementation for RDF2Vec computes ngrams for walks only by splitting (by their symbol “#”) the URIs of subjects and predicates. Indeed, objects being encoded in MD5, splitting in ngrams does not make sense.

It is likely that you want to provide another split strategy for the calculation of the n-grams of the entities. If this is the case, provide your own compute_ngrams_bytes function to FastText.

_model¶: The gensim.models.word2vec model. Defaults to None.

kwargs¶: The keyword arguments dictionary. Defaults to { bucket=2000000, min_count=0, max_n=0, min_n=0,

negative=20, vector_size=500 }

func_computing_ngrams¶: The function to call for the computation of ngrams. In case of reimplementation, it is important to respect the signature imposed by gensim: func(entity: str, minn: int = 0, maxn: int = 0) -> List[bytes] Defaults to compute_ngrams_bytes

fit(walks, is_update=False)¶

Fits the FastText model based on provided walks.

Parameters

walks (List[List[Tuple[str, ...]]]) – The walks to create the corpus to to fit the model.
is_update (bool) – True if the new corpus should be added to old model’s walks, False otherwise. Defaults to False.

Return type

Embedder

Returns

The fitted FastText model.

transform(entities)¶

The features vector of the provided entities.

Args:
entities: The entities including test entities to create the embeddings. Since RDF2Vec is unsupervised, there is no label leakage.

Return type: List[str]
Returns: The features vector of the provided entities.

class pyrdf2vec.embedders.fasttext.RDFFastTextKeyedVectors(bucket=2000000, vector_size=500, *, func_computing_ngrams=None)¶

Bases: gensim.models.fasttext.FastTextKeyedVectors

bucket: int¶

compute_ngrams_bytes(entity, minn=0, maxn=0)¶

Reimplementation of the compute_ngrams_bytes method of gensim. This: overwrite is needed to call our compute_ngrams_bytes method.

Parameters

entity (str) – The entity to hash the ngrams.
minn (int) – Minimum length of char n-grams to be used for training entity representations. Defaults to 0.
maxn (int) – Maximum length of char n-grams to be used for training entity representations. Defaults to 0.

Return type

List[bytes]

Returns

The ngrams bytes.

ft_hash_bytes(bytez)¶

Computes hash based on bytez.

Parameters: bytez (bytes) – The byte to hash
Return type: int
Returns: The hash of the string.

ft_ngram_hashes(entity, minn=0, maxn=0, num_buckets=2000000)¶

Reimplementation of the ft_ngram_hahes method of gensim. This overwrite is needed to call our compute_ngrams_bytes method.

Parameters

entity (str) – The entity to hash the ngrams.
minn (int) – Minimum length of char n-grams to be used for training entity representations. Defaults to 0.
maxn (int) – Maximum length of char n-grams to be used for training entity representations. Defaults to 0.
num_buckets (int) –

Character ngrams are hashed into a fixed number of
buckets, in order to limit the memory usage of the model. Defaults to 2000000.

Returns:
The ngrams hashes.

Return type

List[Any]

get_vector(word, norm=False)¶

Get word representations in vector space, as a 1D numpy array.

Parameters

word (str) – Input word.
norm (bool, optional) – If True, resulting vector will be L2-normalized (unit Euclidean length).

Returns

Vector representation of word.

Return type

numpy.ndarray

Raises

KeyError – If word and all its ngrams not in vocabulary.

recalc_char_ngram_buckets()¶

Reimplementation of the recalc_char_ngram_buckets method of gensim. This overwrite is needed to call our ft_ngram_hashes method.

Return type: None

vector_size: int¶