pyrdf2vec.embedders package

Submodules

Module contents

isort:skip_file

class pyrdf2vec.embedders.Embedder

Bases: object

Base class of the embedding techniques.

abstract fit(corpus, is_update=False)

Fits a model based on the provided corpus.

Parameters

corpus (List[List[Tuple[str, ...]]]) – The corpus to fit the model.

Return type

Embedder

Returns

The fitted model according to an embedding technique.

Raises

NotImplementedError – If this method is called, without having provided an implementation.

abstract transform(entities)

Constructs a features vector of the provided entities.

Parameters

entities (List[str]) – The entities including test entities to create the embeddings. Since RDF2Vec is unsupervised, there is no label leakage.

Return type

List[str]

Returns

The features vector of the provided entities.

Raises

NotImplementedError – If this method is called, without having provided an implementation.

class pyrdf2vec.embedders.FastText(**kwargs)

Bases: pyrdf2vec.embedders.embedder.Embedder

Defines the FastText embedding technique.

SEE: https://radimrehurek.com/gensim/models/fasttext.html

The RDF2Vec implementation of FastText does not consider the min_n and max_n parameters for n_gram splitting.

This implementation for RDF2Vec computes ngrams for walks only by splitting (by their symbol “#”) the URIs of subjects and predicates. Indeed, objects being encoded in MD5, splitting in ngrams does not make sense.

It is likely that you want to provide another split strategy for the calculation of the n-grams of the entities. If this is the case, provide your own compute_ngrams_bytes function to FastText.

_model

The gensim.models.word2vec model. Defaults to None.

kwargs

The keyword arguments dictionary. Defaults to { bucket=2000000, min_count=0, max_n=0, min_n=0,

negative=20, vector_size=500 }

func_computing_ngrams

The function to call for the computation of ngrams. In case of reimplementation, it is important to respect the signature imposed by gensim: func(entity: str, minn: int = 0, maxn: int = 0) -> List[bytes] Defaults to compute_ngrams_bytes

fit(walks, is_update=False)

Fits the FastText model based on provided walks.

Parameters
  • walks (List[List[Tuple[str, ...]]]) – The walks to create the corpus to to fit the model.

  • is_update (bool) – True if the new corpus should be added to old model’s walks, False otherwise. Defaults to False.

Return type

Embedder

Returns

The fitted FastText model.

transform(entities)

The features vector of the provided entities.

Args:

entities: The entities including test entities to create the embeddings. Since RDF2Vec is unsupervised, there is no label leakage.

Return type

List[str]

Returns

The features vector of the provided entities.

class pyrdf2vec.embedders.Word2Vec(**kwargs)

Bases: pyrdf2vec.embedders.embedder.Embedder

Defines the Word2Vec embedding technique.

SEE: https://radimrehurek.com/gensim/models/word2vec.html

_model

The gensim.models.word2vec model. Defaults to None.

kwargs

The keyword arguments dictionary. Defaults to { min_count=0 }.

fit(walks, is_update=False)

Fits the Word2Vec model based on provided walks.

Parameters
  • walks (List[List[Tuple[str, ...]]]) – The walks to create the corpus to to fit the model.

  • is_update (bool) – True if the new walks should be added to old model’s walks, False otherwise. Defaults to False.

Return type

Embedder

Returns

The fitted Word2Vec model.

transform(entities)

The features vector of the provided entities.

Args:

entities: The entities including test entities to create the embeddings. Since RDF2Vec is unsupervised, there is no label leakage.

Return type

List[str]

Returns

The features vector of the provided entities.