faknow.data.dataset

faknow.data.dataset.bigcn_dataset

class faknow.data.dataset.bigcn_dataset.BiGCNDataset(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, td_drop_rate=0.2, bu_drop_rate=0.2)[source]

Bases: Dataset

Dataset for BiGCN.

__init__(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, td_drop_rate=0.2, bu_drop_rate=0.2)[source]
Parameters:
  • nodes_index (List) – node index list.

  • tree_dict (Dict) – dictionary of graph.

  • data_path (str) – the path of data doc, where each sample is a graph with node features, edge indices, the label and the root saved in npz file.

  • lower (int) – the minimum of graph size. default=2.

  • upper (int) – the maximum of graph size. default=100000.

  • td_drop_rate (float) – the dropout rate of TDgraph, default=0.2

  • bu_drop_rate (float) – the dropout rate of BUgraph, default=0.2

faknow.data.dataset.cafe_dataset

class faknow.data.dataset.cafe_dataset.CafeDataset(t_file: str, i_file: str)[source]

Bases: Dataset

__init__(t_file: str, i_file: str)[source]
Parameters:
  • t_file (str) – text file,including text and label

  • i_file (str) – image file

faknow.data.dataset.dataset

class faknow.data.dataset.dataset.Dataset[source]

Bases: Dataset

__init__() None[source]

faknow.data.dataset.dudef_dataset

faknow.data.dataset.fang_dataset

class faknow.data.dataset.fang_dataset.FangDataset(root_path: str)[source]

Bases: Dataset

construct global graph from tsv, txt and csv

__init__(root_path: str)[source]
get_news_label_map(news_info_path: str)[source]
get_train_val_test_labels_nodes(entities: list, news_label_map: dict)[source]
load()[source]

load all doc from root path.

load_and_update_adj_lists(edge_file: str)[source]

construct adjacent_list from edge information doc

load_stance_map(stance_file: str)[source]

get stance feature from stance information doc

preprocess_news_classification_data(news_nodes: set, n_stances: int, adj_lists: dict, stance_lists: dict, news_labels: dict)[source]
Parameters:
  • nodes_batch (set) – the idx list of node to be processed.

  • n_stances (int) – stance num.

  • adj_lists (dict) – adjacent information dict for every node.

  • stance_lists (dict) – stance information dict for every node.

  • news_labels (dict) – news label dict.

class faknow.data.dataset.fang_dataset.FangEvaluateDataSet(data_batch: list, label: list)[source]

Bases: Dataset

DataSet used for validate and test

__init__(data_batch: list, label: list)[source]
faknow.data.dataset.fang_dataset.encode_class_idx_label(label_map: dict)[source]

encode label to one_hot embedding

faknow.data.dataset.fang_dataset.is_tag(entity_type: str, entity: str)[source]

find whether the entity is tagged.

faknow.data.dataset.fang_dataset.load_json(input_path: str)[source]
faknow.data.dataset.fang_dataset.load_text_as_list(input_path: str)[source]
faknow.data.dataset.fang_dataset.read_csv(path: str, load_header: bool = False, delimiter: str = ',')[source]
faknow.data.dataset.fang_dataset.row_normalize(mtx: array)[source]

Row-normalize sparse matrix

faknow.data.dataset.m3fend_dataset

class faknow.data.dataset.m3fend_dataset.M3FENDDataSet(path, max_len, category_dict, dataset)[source]

Bases: Dataset

__init__(path, max_len, category_dict, dataset)[source]
faknow.data.dataset.m3fend_dataset.df_filter(df_data, category_dict)[source]

用于根据给定的 category_dict 字典对输入的 DataFrame 进行过滤 返回过滤后的 DataFrame,其中只包含了在 category_dict 中定义的类别的数据点。

Parameters:
  • df_data (pd.DataFrame) – Input DataFrame.

  • category_dict (Dict[str, int]) – Dictionary mapping category names to integers.

Returns:

Filtered DataFrame containing only the data points with categories defined in category_dict.

Return type:

pd.DataFrame

faknow.data.dataset.m3fend_dataset.read_pkl(path)[source]
faknow.data.dataset.m3fend_dataset.word2input(texts, max_len, dataset)[source]

Tokenize input texts using BERT or RoBERTa tokenizer based on the dataset. Return tokenized input IDs and masks.

Parameters:
  • texts (List[str]) – List of input texts.

  • max_len (int) – Maximum sequence length.

  • dataset (str) – Dataset identifier (‘ch’ for Chinese, ‘en’ for English).

Returns:

Tokenized input IDs and masks.

Return type:

Tuple[torch.Tensor, torch.Tensor]

faknow.data.dataset.multi_modal

class faknow.data.dataset.multi_modal.MultiModalDataset(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], image_features: List[str], transform: Callable[[str], Any])[source]

Bases: TextDataset

Dataset for json file with post texts and images, allow users to tokenize texts and convert them into tensors, inherit from TextDataset.

root

absolute path to json file

Type:

str

data

data in json file

Type:

dict

feature_names

names of all features in json file

Type:

List[str]

tokenize

function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors

Type:

Callable[[List[str]], Any]

text_features

a dict of text features, key is feature name, value is feature values

Type:

dict

image_features

a list of image features

Type:

List[str]

transform

function to transform image, which takes a path to image and returns a tensor or a dict of tensors

Type:

Callable[[str], Any]

__init__(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], image_features: List[str], transform: Callable[[str], Any])[source]
Parameters:
  • path (str) – absolute path to json file

  • text_features (List[str]) – a list of names of text features in json file

  • tokenize (Callable[[List[str]], Any]) – function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors

  • image_features (List[str]) – a list of names of image features in json file

  • transform (Callable[[str], Any]) – function to transform image, which takes a path to image and returns a tensor or a dict of tensors

process_image(name: str)[source]

Mark a feature as image features.

Parameters:

name (str) – name of feature to mark as image features

Raises:
  • ValueError – if there is no feature named ‘name’

  • ValueError – if feature has already been marked as image features

remove_image(name: str)[source]

Remove a feature from image features.

Parameters:

name (str) – name of feature to remove from image features

Raises:
  • ValueError – if there is no feature named ‘name’

  • ValueError – if feature has not been marked as image features

faknow.data.dataset.safe_dataset

class faknow.data.dataset.safe_dataset.SAFENumpyDataset(root_dir: str)[source]

Bases: Dataset

__init__(root_dir: str)[source]

faknow.data.dataset.spotfake_dataset

faknow.data.dataset.text

class faknow.data.dataset.text.TextDataset(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], to_tensor=True)[source]

Bases: Dataset

Dataset for json file with post texts, allow users to tokenize texts and convert them into tensors.

root

absolute path to json file

Type:

str

data

data in json file

Type:

dict

feature_names

names of all features in json file

Type:

List[str]

tokenize

function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors

Type:

Callable[[List[str]], Any]

text_features

a dict of text features, key is feature name, value is feature values

Type:

dict

__init__(path: str, text_features: List[str], tokenize: Callable[[List[str]], Any], to_tensor=True)[source]
Parameters:
  • path (str) – absolute path to json file

  • text_features (List[str]) – a list of names of text features in json file

  • tokenize (Callable[[List[str]], Any]) – function to tokenize text, which takes a list of texts and returns a tensor or a dict of tensors

  • to_tensor (bool, optional) – whether to convert all features into tensor. Default=True.

check_feature(name: str)[source]
Parameters:

name (str) – name of feature to check

Raises:

ValueError – if there is no feature named ‘name’

process_text(name: str)[source]

process text feature with tokenize function, store the old value of the feature in text_features, and store the new value in data.

Parameters:

name (str) – name of text feature to process

remove_text(name: str)[source]

remove text feature from self.text_features

Parameters:

name (str) – name of text feature to remove

Raises:
  • ValueError – if there is no feature named ‘name’

  • ValueError – if ‘name’ has not been marked as text features

faknow.data.dataset.trustrd_dataset

class faknow.data.dataset.trustrd_dataset.TrustRDDataset(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, drop_rate=0)[source]

Bases: Dataset

Dataset for TrustRD

__init__(nodes_index: List, tree_dict: Dict, data_path: str, lower=2, upper=100000, drop_rate=0)[source]
Parameters:
  • nodes_index (List) – node index list.

  • tree_dict (Dict) – dictionary of graph.

  • data_path (str) – the path of data doc, where each sample is a graph with node features, edge indices, the label and the root saved in npz file.

  • lower (int) – the minimum of graph size. default=2.

  • upper (int) – the maximum of graph size. default=100000.

  • drop_rate (float) – the dropout rate of edge. default=0