faknow.data.process

faknow.data.process.process

class faknow.data.process.process.DropEdge(td_drop_rate: float, bu_drop_rate: float)[source]

Bases: object

randomly drop out edges for BiGCN

__init__(td_drop_rate: float, bu_drop_rate: float)[source]
Parameters:
  • td_drop_rate (float) – drop rate of drop edge in top-down direction

  • bu_drop_rate (float) – drop rate of drop edge in bottom-up direction

class faknow.data.process.process.ProcessorForVit(vit_name: str, image_mean: List[float], image_std: List[float])[source]

Bases: object

processor for ViT

__init__(vit_name: str, image_mean: List[float], image_std: List[float])[source]
Parameters:
  • vit_name (str) – name of ViT model

  • image_mean (List[float]) – mean of image

  • image_std (List[float]) – std of image

faknow.data.process.process.calculate_cos_matrix(matrix1: Tensor, matrix2: Tensor)[source]

Calculate the pairwise cosine similarity matrix between two matrices.

Parameters:
  • matrix1 (torch.Tensor) – The first matrix, shape=(n, d)

  • matrix2 (torch.Tensor) – The second matrix, shape=(m, d)

Returns:

The cosine similarity matrix, shape=(n, m)

Return type:

torch.Tensor

faknow.data.process.process.lsh_data_selection(domain_embeddings: Tensor, labelling_budget=100, hash_dimension=10) List[int][source]

Local sensitive hash (LSH) selection for training dataset.

Parameters:
  • domain_embeddings (torch.Tensor) – 2-D domain embedding tensor of samples.

  • labelling_budget (int) – Number of selection budget, must be smaller than the number of samples. Default=100.

  • hash_dimension (int) – Dimension of random hash vector. Default=10.

Returns:

A list of selected sample indices.

Return type:

List[int]

faknow.data.process.process.split_dataset(data_path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], image_features: List[str] | None = None, transform: Callable[[str], Any] | None = None, ratio: List[float] | None = None) List[Subset[Any]][source]

split TextDataset or MultiModalDataset with given ratio. If image_features is None, split TextDataset, else split MultiModalDataset.

Parameters:
  • data_path (str) – path to json file

  • text_features (List[str]) – a list of names of text features in files

  • tokenize (Callable[[List[str]], Any]) – function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors

  • image_features (List[str]) – a list of names of image features in files. Default=None.

  • transform (Callable[[str], Any]) – function to transform image, which takes a path to image and returns a tensor or a dict of tensors. Default=None.

  • ratio (List[float]) – a list of ratios of subset. If None, default to [0.7, 0.1, 0.2]. Default=None.

Returns:

a list of subsets of dataset

Return type:

subsets (List[Subset[Any]])

faknow.data.process.text_process

class faknow.data.process.text_process.TokenizerFromPreTrained(max_len: int, bert: str, text_preprocessing: Callable[[List[str]], List[str]] | None = None)[source]

Bases: object

Tokenizer for pre-trained models in transformers with fixed length, return token_id and mask

__init__(max_len: int, bert: str, text_preprocessing: Callable[[List[str]], List[str]] | None = None)[source]
Parameters:
  • max_len (int) – max length of input text

  • bert (str) – bert model name

  • text_preprocessing (Optional[Callable[[List[str]], List[str]]]) – text preprocessing function. Defaults to None.

faknow.data.process.text_process.chinese_tokenize(text: str, stop_words: List[str] | None = None) List[str][source]

tokenize chinese text with jieba and regex to remove punctuation

Parameters:
  • text (str) – text to be tokenized

  • stop_words (List[str]) – stop words, default=None

Returns:

tokenized text

Return type:

List[str]

faknow.data.process.text_process.english_tokenize(text: str) List[str][source]

tokenize english text with nltk and regex to remove punctuation

Parameters:

text (str) – text to be tokenized

Returns:

tokenized text

Return type:

List[str]

faknow.data.process.text_process.read_stop_words(path: str) List[str][source]

Read stop words from a file.

Parameters:

path (str) – The path to the file containing stop words.

Returns:

A list of stop words.

Return type:

List[str]