pyrdf2vec.embedders.fasttext module

class pyrdf2vec.embedders.fasttext.FastText(**kwargs)

Bases: pyrdf2vec.embedders.embedder.Embedder

Defines the FastText embedding technique.

SEE: https://radimrehurek.com/gensim/models/fasttext.html

The RDF2Vec implementation of FastText does not consider the min_n and max_n parameters for n_gram splitting.

This implementation for RDF2Vec computes ngrams for walks only by splitting (by their symbol “#”) the URIs of subjects and predicates. Indeed, objects being encoded in MD5, splitting in ngrams does not make sense.

It is likely that you want to provide another split strategy for the calculation of the n-grams of the entities. If this is the case, provide your own compute_ngrams_bytes function to FastText.

_model

The gensim.models.word2vec model. Defaults to None.

kwargs

The keyword arguments dictionary. Defaults to { bucket=2000000, min_count=0, max_n=0, min_n=0,

negative=20, vector_size=500 }

func_computing_ngrams

The function to call for the computation of ngrams. In case of reimplementation, it is important to respect the signature imposed by gensim: func(entity: str, minn: int = 0, maxn: int = 0) -> List[bytes] Defaults to compute_ngrams_bytes

fit(walks, is_update=False)

Fits the FastText model based on provided walks.

Parameters
  • walks (List[List[Tuple[str, ...]]]) – The walks to create the corpus to to fit the model.

  • is_update (bool) – True if the new corpus should be added to old model’s walks, False otherwise. Defaults to False.

Return type

Embedder

Returns

The fitted FastText model.

transform(entities)

The features vector of the provided entities.

Args:

entities: The entities including test entities to create the embeddings. Since RDF2Vec is unsupervised, there is no label leakage.

Return type

List[str]

Returns

The features vector of the provided entities.

class pyrdf2vec.embedders.fasttext.RDFFastTextKeyedVectors(bucket=2000000, vector_size=500, *, func_computing_ngrams=None)

Bases: gensim.models.fasttext.FastTextKeyedVectors

bucket: int
compute_ngrams_bytes(entity, minn=0, maxn=0)
Reimplementation of the compute_ngrams_bytes method of gensim. This

overwrite is needed to call our compute_ngrams_bytes method.

Parameters
  • entity (str) – The entity to hash the ngrams.

  • minn (int) – Minimum length of char n-grams to be used for training entity representations. Defaults to 0.

  • maxn (int) – Maximum length of char n-grams to be used for training entity representations. Defaults to 0.

Return type

List[bytes]

Returns

The ngrams bytes.

ft_hash_bytes(bytez)

Computes hash based on bytez.

Parameters

bytez (bytes) – The byte to hash

Return type

int

Returns

The hash of the string.

ft_ngram_hashes(entity, minn=0, maxn=0, num_buckets=2000000)

Reimplementation of the ft_ngram_hahes method of gensim. This overwrite is needed to call our compute_ngrams_bytes method.

Parameters
  • entity (str) – The entity to hash the ngrams.

  • minn (int) – Minimum length of char n-grams to be used for training entity representations. Defaults to 0.

  • maxn (int) – Maximum length of char n-grams to be used for training entity representations. Defaults to 0.

  • num_buckets (int) –

    Character ngrams are hashed into a fixed number of

    buckets, in order to limit the memory usage of the model. Defaults to 2000000.

    Returns:

    The ngrams hashes.

Return type

List[Any]

get_vector(word, norm=False)

Get word representations in vector space, as a 1D numpy array.

Parameters
  • word (str) – Input word.

  • norm (bool, optional) – If True, resulting vector will be L2-normalized (unit Euclidean length).

Returns

Vector representation of word.

Return type

numpy.ndarray

Raises

KeyError – If word and all its ngrams not in vocabulary.

recalc_char_ngram_buckets()

Reimplementation of the recalc_char_ngram_buckets method of gensim. This overwrite is needed to call our ft_ngram_hashes method.

Return type

None

vector_size: int