From fd9d0842ea38c2faebfc7cddad8220727ebe54ce Mon Sep 17 00:00:00 2001 From: Nick Lin <33555716+nicklin96@users.noreply.github.com> Date: Mon, 24 Jun 2019 11:20:22 +0800 Subject: [PATCH] Update How_to_generate_fragments.md --- genrate_fragments/How_to_generate_fragments.md | 44 ++++++++++++++++++++++++-- 1 file changed, 42 insertions(+), 2 deletions(-) diff --git a/genrate_fragments/How_to_generate_fragments.md b/genrate_fragments/How_to_generate_fragments.md index bf98187..f3453b2 100644 --- a/genrate_fragments/How_to_generate_fragments.md +++ b/genrate_fragments/How_to_generate_fragments.md @@ -2,11 +2,15 @@ There are three kinds of fragments in gAnswer: entity fragments, predicate fragments and type fragments. They are information extracted from the triples helping gAnswer improve its results. In this section we will show you how to generate your own fragments step by step with a simple example ### Step 1: Clean the triple files -Suppose we have a triple file containing only 3 triples: +Suppose we have a triple file containing only seven triples: ```java + + "Jeff" + "Tom" - + + ``` This is the exactly form of triples we need to generate fragments. However sometimes the entity and predicate contain some extra information. Take dbpedia dataset as an example. The following is the original form of a dbpedia triple ```java @@ -15,3 +19,39 @@ This is the exactly form of triples we need to generate fragments. However somet As you can see, every entity and predicate is marked with an URI, but we don't need the prefix of the URIs. See Step1_clean_triples.py. That is the code we use to clean dbpedia triples. Generally, please remember that making sure the entity and predicate names are clear enough to indicate their true meaning and contain no extra information is all you need to do in this step. By the way, if you have more than one triple files, please combine them into one so that the following steps will be easier. + +### Step 2: remove duplicate triples +One triple may occur more than once in the clean triple file, especially when you combine many triple files into one. +gAnswer is OK with receving duplicate triples but it will influence its performance. + +### Step 3: extract entity, predicate and type name for id allocation +To save space cost, the fragment files are not constructed based on entity, predicate and type names themselves but their ids. Therefore, we must extract every entity, predicate and type name out of the triple file and give them a uniue id respectively. In our example,the id files will goes like this: +```java +//Entity ids + 1 + 2 + 3 + +//predicate ids + 1 + 2 + 3 + 4 + +//type ids + 1 + 2 +``` + +### Step 4: represent triples with ids +For convenience, before we generate the fragments, we first replace all the name strings in triple file with corresponding ids. +In our example, the new triple file is like: +```java +1 1 3 +2 2 1 +1 4 -1 +2 4 -1 +1 3 1 +2 3 1 +3 3 2 +```