faknow.data.process
faknow.data.process.process
- class faknow.data.process.process.DropEdge(td_drop_rate: float, bu_drop_rate: float)[source]
Bases:
object
randomly drop out edges for BiGCN
- class faknow.data.process.process.ProcessorForVit(vit_name: str, image_mean: List[float], image_std: List[float])[source]
Bases:
object
processor for ViT
- faknow.data.process.process.calculate_cos_matrix(matrix1: Tensor, matrix2: Tensor)[source]
Calculate the pairwise cosine similarity matrix between two matrices.
- Parameters:
matrix1 (torch.Tensor) – The first matrix, shape=(n, d)
matrix2 (torch.Tensor) – The second matrix, shape=(m, d)
- Returns:
The cosine similarity matrix, shape=(n, m)
- Return type:
torch.Tensor
- faknow.data.process.process.lsh_data_selection(domain_embeddings: Tensor, labelling_budget=100, hash_dimension=10) List[int] [source]
Local sensitive hash (LSH) selection for training dataset.
- Parameters:
domain_embeddings (torch.Tensor) – 2-D domain embedding tensor of samples.
labelling_budget (int) – Number of selection budget, must be smaller than the number of samples. Default=100.
hash_dimension (int) – Dimension of random hash vector. Default=10.
- Returns:
A list of selected sample indices.
- Return type:
List[int]
- faknow.data.process.process.split_dataset(data_path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], image_features: List[str] | None = None, transform: Callable[[str], Any] | None = None, ratio: List[float] | None = None) List[Subset[Any]] [source]
split TextDataset or MultiModalDataset with given ratio. If image_features is None, split TextDataset, else split MultiModalDataset.
- Parameters:
data_path (str) – path to json file
text_features (List[str]) – a list of names of text features in files
tokenize (Callable[[List[str]], Any]) – function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors
image_features (List[str]) – a list of names of image features in files. Default=None.
transform (Callable[[str], Any]) – function to transform image, which takes a path to image and returns a tensor or a dict of tensors. Default=None.
ratio (List[float]) – a list of ratios of subset. If None, default to [0.7, 0.1, 0.2]. Default=None.
- Returns:
a list of subsets of dataset
- Return type:
subsets (List[Subset[Any]])
faknow.data.process.text_process
- class faknow.data.process.text_process.TokenizerFromPreTrained(max_len: int, bert: str, text_preprocessing: Callable[[List[str]], List[str]] | None = None)[source]
Bases:
object
Tokenizer for pre-trained models in transformers with fixed length, return token_id and mask
- __init__(max_len: int, bert: str, text_preprocessing: Callable[[List[str]], List[str]] | None = None)[source]
- Parameters:
max_len (int) – max length of input text
bert (str) – bert model name
text_preprocessing (Optional[Callable[[List[str]], List[str]]]) – text preprocessing function. Defaults to None.
- faknow.data.process.text_process.chinese_tokenize(text: str, stop_words: List[str] | None = None) List[str] [source]
tokenize chinese text with jieba and regex to remove punctuation
- Parameters:
text (str) – text to be tokenized
stop_words (List[str]) – stop words, default=None
- Returns:
tokenized text
- Return type:
List[str]