pyrdf2vec.embedders.fasttext module¶
- class pyrdf2vec.embedders.fasttext.FastText(**kwargs)¶
Bases:
pyrdf2vec.embedders.embedder.Embedder
Defines the FastText embedding technique.
SEE: https://radimrehurek.com/gensim/models/fasttext.html
The RDF2Vec implementation of FastText does not consider the min_n and max_n parameters for n_gram splitting.
This implementation for RDF2Vec computes ngrams for walks only by splitting (by their symbol “#”) the URIs of subjects and predicates. Indeed, objects being encoded in MD5, splitting in ngrams does not make sense.
It is likely that you want to provide another split strategy for the calculation of the n-grams of the entities. If this is the case, provide your own compute_ngrams_bytes function to FastText.
- _model¶
The gensim.models.word2vec model. Defaults to None.
- kwargs¶
The keyword arguments dictionary. Defaults to { bucket=2000000, min_count=0, max_n=0, min_n=0,
negative=20, vector_size=500 }
- func_computing_ngrams¶
The function to call for the computation of ngrams. In case of reimplementation, it is important to respect the signature imposed by gensim: func(entity: str, minn: int = 0, maxn: int = 0) -> List[bytes] Defaults to compute_ngrams_bytes
- fit(walks, is_update=False)¶
Fits the FastText model based on provided walks.
- class pyrdf2vec.embedders.fasttext.RDFFastTextKeyedVectors(bucket=2000000, vector_size=500, *, func_computing_ngrams=None)¶
Bases:
gensim.models.fasttext.FastTextKeyedVectors
- compute_ngrams_bytes(entity, minn=0, maxn=0)¶
- Reimplementation of the compute_ngrams_bytes method of gensim. This
overwrite is needed to call our compute_ngrams_bytes method.
- Parameters
- Return type
- Returns
The ngrams bytes.
- ft_hash_bytes(bytez)¶
Computes hash based on bytez.
- ft_ngram_hashes(entity, minn=0, maxn=0, num_buckets=2000000)¶
Reimplementation of the ft_ngram_hahes method of gensim. This overwrite is needed to call our compute_ngrams_bytes method.
- Parameters
entity (
str
) – The entity to hash the ngrams.minn (
int
) – Minimum length of char n-grams to be used for training entity representations. Defaults to 0.maxn (
int
) – Maximum length of char n-grams to be used for training entity representations. Defaults to 0.num_buckets (
int
) –- Character ngrams are hashed into a fixed number of
buckets, in order to limit the memory usage of the model. Defaults to 2000000.
- Returns:
The ngrams hashes.
- Return type
- get_vector(word, norm=False)¶
Get word representations in vector space, as a 1D numpy array.
- recalc_char_ngram_buckets()¶
Reimplementation of the recalc_char_ngram_buckets method of gensim. This overwrite is needed to call our ft_ngram_hashes method.
- Return type