faknow.data.dataset
faknow.data.dataset.bigcn_dataset
- class faknow.data.dataset.bigcn_dataset.BiGCNDataset(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, td_drop_rate=0.2, bu_drop_rate=0.2)[source]
Bases:
Dataset
Dataset for BiGCN.
- __init__(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, td_drop_rate=0.2, bu_drop_rate=0.2)[source]
- Parameters:
nodes_index (List) – node index list.
tree_dict (Dict) – dictionary of graph.
data_path (str) – the path of data doc, where each sample is a graph with node features, edge indices, the label and the root saved in npz file.
lower (int) – the minimum of graph size. default=2.
upper (int) – the maximum of graph size. default=100000.
td_drop_rate (float) – the dropout rate of TDgraph, default=0.2
bu_drop_rate (float) – the dropout rate of BUgraph, default=0.2
faknow.data.dataset.cafe_dataset
faknow.data.dataset.dataset
faknow.data.dataset.dudef_dataset
faknow.data.dataset.fang_dataset
- class faknow.data.dataset.fang_dataset.FangDataset(root_path: str)[source]
Bases:
Dataset
construct global graph from tsv, txt and csv
- load_and_update_adj_lists(edge_file: str)[source]
construct adjacent_list from edge information doc
- preprocess_news_classification_data(news_nodes: set, n_stances: int, adj_lists: dict, stance_lists: dict, news_labels: dict)[source]
- Parameters:
nodes_batch (set) – the idx list of node to be processed.
n_stances (int) – stance num.
adj_lists (dict) – adjacent information dict for every node.
stance_lists (dict) – stance information dict for every node.
news_labels (dict) – news label dict.
- class faknow.data.dataset.fang_dataset.FangEvaluateDataSet(data_batch: list, label: list)[source]
Bases:
Dataset
DataSet used for validate and test
- faknow.data.dataset.fang_dataset.encode_class_idx_label(label_map: dict)[source]
encode label to one_hot embedding
- faknow.data.dataset.fang_dataset.is_tag(entity_type: str, entity: str)[source]
find whether the entity is tagged.
faknow.data.dataset.m3fend_dataset
- class faknow.data.dataset.m3fend_dataset.M3FENDDataSet(path, max_len, category_dict, dataset)[source]
Bases:
Dataset
- faknow.data.dataset.m3fend_dataset.df_filter(df_data, category_dict)[source]
用于根据给定的 category_dict 字典对输入的 DataFrame 进行过滤 返回过滤后的 DataFrame,其中只包含了在 category_dict 中定义的类别的数据点。
- Parameters:
df_data (pd.DataFrame) – Input DataFrame.
category_dict (Dict[str, int]) – Dictionary mapping category names to integers.
- Returns:
Filtered DataFrame containing only the data points with categories defined in category_dict.
- Return type:
pd.DataFrame
- faknow.data.dataset.m3fend_dataset.word2input(texts, max_len, dataset)[source]
Tokenize input texts using BERT or RoBERTa tokenizer based on the dataset. Return tokenized input IDs and masks.
- Parameters:
texts (List[str]) – List of input texts.
max_len (int) – Maximum sequence length.
dataset (str) – Dataset identifier (‘ch’ for Chinese, ‘en’ for English).
- Returns:
Tokenized input IDs and masks.
- Return type:
Tuple[torch.Tensor, torch.Tensor]
faknow.data.dataset.multi_modal
- class faknow.data.dataset.multi_modal.MultiModalDataset(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], image_features: List[str], transform: Callable[[str], Any])[source]
Bases:
TextDataset
Dataset for json file with post texts and images, allow users to tokenize texts and convert them into tensors, inherit from TextDataset.
- root
absolute path to json file
- Type:
str
- data
data in json file
- Type:
dict
- feature_names
names of all features in json file
- Type:
List[str]
- tokenize
function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors
- Type:
Callable[[List[str]], Any]
- text_features
a dict of text features, key is feature name, value is feature values
- Type:
dict
- image_features
a list of image features
- Type:
List[str]
- transform
function to transform image, which takes a path to image and returns a tensor or a dict of tensors
- Type:
Callable[[str], Any]
- __init__(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], image_features: List[str], transform: Callable[[str], Any])[source]
- Parameters:
path (str) – absolute path to json file
text_features (List[str]) – a list of names of text features in json file
tokenize (Callable[[List[str]], Any]) – function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors
image_features (List[str]) – a list of names of image features in json file
transform (Callable[[str], Any]) – function to transform image, which takes a path to image and returns a tensor or a dict of tensors
faknow.data.dataset.safe_dataset
faknow.data.dataset.spotfake_dataset
faknow.data.dataset.text
- class faknow.data.dataset.text.TextDataset(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], to_tensor=True)[source]
Bases:
Dataset
Dataset for json file with post texts, allow users to tokenize texts and convert them into tensors.
- root
absolute path to json file
- Type:
str
- data
data in json file
- Type:
dict
- feature_names
names of all features in json file
- Type:
List[str]
- tokenize
function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors
- Type:
Callable[[List[str]], Any]
- text_features
a dict of text features, key is feature name, value is feature values
- Type:
dict
- __init__(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], to_tensor=True)[source]
- Parameters:
path (str) – absolute path to json file
text_features (List[str]) – a list of names of text features in json file
tokenize (Callable[[List[str]], Any]) – function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors
to_tensor (bool, optional) – whether to convert all features into tensor. Default=True.
- check_feature(name: str)[source]
- Parameters:
name (str) – name of feature to check
- Raises:
ValueError – if there is no feature named ‘name’
faknow.data.dataset.trustrd_dataset
- class faknow.data.dataset.trustrd_dataset.TrustRDDataset(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, drop_rate=0)[source]
Bases:
Dataset
Dataset for TrustRD
- __init__(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, drop_rate=0)[source]
- Parameters:
nodes_index (List) – node index list.
tree_dict (Dict) – dictionary of graph.
data_path (str) – the path of data doc, where each sample is a graph with node features, edge indices, the label and the root saved in npz file.
lower (int) – the minimum of graph size. default=2.
upper (int) – the maximum of graph size. default=100000.
drop_rate (float) – the dropout rate of edge. default=0