faknow.data.dataset

faknow.data.dataset.bigcn_dataset

class faknow.data.dataset.bigcn_dataset.BiGCNDataset(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, td_drop_rate=0.2, bu_drop_rate=0.2)[source]

Bases: Dataset

Dataset for BiGCN.

__init__(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, td_drop_rate=0.2, bu_drop_rate=0.2)[source]

Parameters:

nodes_index (List) – node index list.
tree_dict (Dict) – dictionary of graph.
data_path (str) – the path of data doc, where each sample is a graph with node features, edge indices, the label and the root saved in npz file.
lower (int) – the minimum of graph size. default=2.
upper (int) – the maximum of graph size. default=100000.
td_drop_rate (float) – the dropout rate of TDgraph, default=0.2
bu_drop_rate (float) – the dropout rate of BUgraph, default=0.2

faknow.data.dataset.cafe_dataset

class faknow.data.dataset.cafe_dataset.CafeDataset(t_file: str, i_file: str)[source]

Bases: Dataset

__init__(t_file: str, i_file: str)[source]

Parameters:

t_file (str) – text file,including text and label
i_file (str) – image file

faknow.data.dataset.dataset

class faknow.data.dataset.dataset.Dataset[source]

Bases: Dataset

__init__() → None[source]

faknow.data.dataset.dudef_dataset

faknow.data.dataset.fang_dataset

class faknow.data.dataset.fang_dataset.FangDataset(root_path: str)[source]

Bases: Dataset

construct global graph from tsv, txt and csv

__init__(root_path: str)[source]

get_news_label_map(news_info_path: str)[source]

get_train_val_test_labels_nodes(entities: list, news_label_map: dict)[source]

load()[source]: load all doc from root path.

load_and_update_adj_lists(edge_file: str)[source]: construct adjacent_list from edge information doc

load_stance_map(stance_file: str)[source]: get stance feature from stance information doc

preprocess_news_classification_data(news_nodes: set, n_stances: int, adj_lists: dict, stance_lists: dict, news_labels: dict)[source]

Parameters:

nodes_batch (set) – the idx list of node to be processed.
n_stances (int) – stance num.
adj_lists (dict) – adjacent information dict for every node.
stance_lists (dict) – stance information dict for every node.
news_labels (dict) – news label dict.

class faknow.data.dataset.fang_dataset.FangEvaluateDataSet(data_batch: list, label: list)[source]

Bases: Dataset

DataSet used for validate and test

__init__(data_batch: list, label: list)[source]

faknow.data.dataset.fang_dataset.encode_class_idx_label(label_map: dict)[source]: encode label to one_hot embedding

faknow.data.dataset.fang_dataset.is_tag(entity_type: str, entity: str)[source]: find whether the entity is tagged.

faknow.data.dataset.fang_dataset.load_json(input_path: str)[source]

faknow.data.dataset.fang_dataset.load_text_as_list(input_path: str)[source]

faknow.data.dataset.fang_dataset.read_csv(path: str, load_header: bool = False, delimiter: str = ',')[source]

faknow.data.dataset.fang_dataset.row_normalize(mtx: array)[source]: Row-normalize sparse matrix

faknow.data.dataset.m3fend_dataset

class faknow.data.dataset.m3fend_dataset.M3FENDDataSet(path, max_len, category_dict, dataset)[source]

Bases: Dataset

__init__(path, max_len, category_dict, dataset)[source]

faknow.data.dataset.m3fend_dataset.df_filter(df_data, category_dict)[source]

用于根据给定的 category_dict 字典对输入的 DataFrame 进行过滤返回过滤后的 DataFrame，其中只包含了在 category_dict 中定义的类别的数据点。

Parameters:

df_data (pd.DataFrame) – Input DataFrame.
category_dict (Dict[str, int]) – Dictionary mapping category names to integers.

Returns:

Filtered DataFrame containing only the data points with categories defined in category_dict.

Return type:

pd.DataFrame

faknow.data.dataset.m3fend_dataset.read_pkl(path)[source]

faknow.data.dataset.m3fend_dataset.word2input(texts, max_len, dataset)[source]

Tokenize input texts using BERT or RoBERTa tokenizer based on the dataset. Return tokenized input IDs and masks.

Parameters:

texts (List[str]) – List of input texts.
max_len (int) – Maximum sequence length.
dataset (str) – Dataset identifier (‘ch’ for Chinese, ‘en’ for English).

Returns:

Tokenized input IDs and masks.

Return type:

Tuple[torch.Tensor, torch.Tensor]

faknow.data.dataset.multi_modal

class faknow.data.dataset.multi_modal.MultiModalDataset(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], image_features: List[str], transform: Callable[[str], Any])[source]

Bases: TextDataset

Dataset for json file with post texts and images, allow users to tokenize texts and convert them into tensors, inherit from TextDataset.

root

absolute path to json file

Type:: str

data

data in json file

Type:: dict

feature_names

names of all features in json file

Type:: List[str]

tokenize

function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors

Type:: Callable[[List[str]], Any]

text_features

a dict of text features, key is feature name, value is feature values

Type:: dict

image_features

a list of image features

Type:: List[str]

transform

function to transform image, which takes a path to image and returns a tensor or a dict of tensors

Type:: Callable[[str], Any]

__init__(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], image_features: List[str], transform: Callable[[str], Any])[source]

Parameters:

path (str) – absolute path to json file
text_features (List[str]) – a list of names of text features in json file
tokenize (Callable[[List[str]], Any]) – function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors
image_features (List[str]) – a list of names of image features in json file
transform (Callable[[str], Any]) – function to transform image, which takes a path to image and returns a tensor or a dict of tensors

process_image(name: str)[source]

Mark a feature as image features.

Parameters:

name (str) – name of feature to mark as image features

Raises:

ValueError – if there is no feature named ‘name’
ValueError – if feature has already been marked as image features

remove_image(name: str)[source]

Remove a feature from image features.

Parameters:

name (str) – name of feature to remove from image features

Raises:

ValueError – if there is no feature named ‘name’
ValueError – if feature has not been marked as image features

faknow.data.dataset.safe_dataset

class faknow.data.dataset.safe_dataset.SAFENumpyDataset(root_dir: str)[source]

Bases: Dataset

__init__(root_dir: str)[source]

faknow.data.dataset.spotfake_dataset

faknow.data.dataset.text

class faknow.data.dataset.text.TextDataset(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], to_tensor=True)[source]

Bases: Dataset

Dataset for json file with post texts, allow users to tokenize texts and convert them into tensors.

root

absolute path to json file

Type:: str

data

data in json file

Type:: dict

feature_names

names of all features in json file

Type:: List[str]

tokenize

function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors

Type:: Callable[[List[str]], Any]

text_features

a dict of text features, key is feature name, value is feature values

Type:: dict

__init__(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], to_tensor=True)[source]

Parameters:

path (str) – absolute path to json file
text_features (List[str]) – a list of names of text features in json file
tokenize (Callable[[List[str]], Any]) – function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors
to_tensor (bool, optional) – whether to convert all features into tensor. Default=True.

check_feature(name: str)[source]

Parameters:: name (str) – name of feature to check
Raises:: ValueError – if there is no feature named ‘name’

process_text(name: str)[source]

process text feature with tokenize function, store the old value of the feature in text_features, and store the new value in data.

Parameters:: name (str) – name of text feature to process

remove_text(name: str)[source]

remove text feature from self.text_features

Parameters:

name (str) – name of text feature to remove

Raises:

ValueError – if there is no feature named ‘name’
ValueError – if ‘name’ has not been marked as text features

faknow.data.dataset.trustrd_dataset

class faknow.data.dataset.trustrd_dataset.TrustRDDataset(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, drop_rate=0)[source]

Bases: Dataset

Dataset for TrustRD

__init__(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, drop_rate=0)[source]

Parameters:

nodes_index (List) – node index list.
tree_dict (Dict) – dictionary of graph.
data_path (str) – the path of data doc, where each sample is a graph with node features, edge indices, the label and the root saved in npz file.
lower (int) – the minimum of graph size. default=2.
upper (int) – the maximum of graph size. default=100000.
drop_rate (float) – the dropout rate of edge. default=0