You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

How_to_generate_fragments.md 6.7 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
  1. ## How to generate fragments out of your own triples
  2. There are three kinds of fragments in gAnswer: entity fragments, predicate fragments and type fragments. They are information extracted from the triples helping gAnswer improve its results. In this section we will show you how to generate your own fragments step by step with a simple example
  3. ### Step 1: Clean the triple files
  4. Suppose we have a triple file containing only seven triples:
  5. ```java
  6. <StudentA> <major> <computer_science>
  7. <StudentB> <friend_of> <StudentA>
  8. <StudentB> <deskmate_of> <StudentA>
  9. <StudentA> <name> "Jeff"
  10. <StudentB> <name> "Tom"
  11. <StudentA> <type> <Person>
  12. <StudentB> <type> <Person>
  13. <computer_science> <type> <Subject>
  14. ```
  15. Generally speaking, there are three segment
  16. This is the exactly form of triples we need to generate fragments. However sometimes the entity and predicate contain some extra information. Take dbpedia dataset as an example. The following is the original form of a dbpedia triple
  17. ```java
  18. <http://dbpedia.org/resource/Alabama> <http://dbpedia.org/property/demonym> <http://dbpedia.org/resource/Adjectivals_and_demonyms_for_U.S._states> .
  19. ```
  20. As you can see, every entity and predicate is marked with an URI, but we don't need the prefix of the URIs. See Step1_clean_triples.py. That is the code we use to clean dbpedia triples.
  21. Generally, please remember that making sure the entity and predicate names are clear enough to indicate their true meaning and contain no extra information is all you need to do in this step.
  22. By the way, if you have more than one triple files, please combine them into one so that the following steps will be easier.
  23. ### Step 2: remove duplicate triples
  24. One triple may occur more than once in the clean triple file, especially when you combine many triple files into one.
  25. gAnswer is OK with receiving duplicate triples but it will influence its performance.
  26. ### Step 3: extract entity, predicate and type name for id allocation
  27. To save space cost, the fragment files are not constructed based on entity, predicate and type names themselves but their ids. Therefore, we must extract every entity, predicate and type name out of the triple file and give them a uniue id respectively. In our example,the id files will goes like this:
  28. ```java
  29. //Entity ids
  30. <StudentA> 1
  31. <StudentB> 2
  32. <computer_science> 3
  33. //predicate ids
  34. <major> 1
  35. <friend_of> 2
  36. <type> 3
  37. <name> 4
  38. <deskmate_of> 5
  39. //type ids
  40. <Person> 1
  41. <Subject> 2
  42. ```
  43. ### Step 4: represent triples with ids
  44. For convenience, before we generate the fragments, we first replace all the name strings in triple file with corresponding ids.
  45. In our example, the new triple file is like:
  46. ```java
  47. 1 1 3
  48. 2 2 1
  49. 1 4 -1
  50. 2 4 -1
  51. 1 3 1
  52. 2 3 1
  53. 3 3 2
  54. ```
  55. Notice that we use -1 to represent values that a not entity nor type, such as numbers and literals.
  56. ### Step 5: generate entity fragments
  57. Finally we are going to generate entity fragments now. Every entity has its own piece of fragment.Fragments are information about the edges related with the entity as well as its neighbor entities.First let's clearify the idea of subject and object in a triple. A triple consist of three parts: subject, predicate and object. For example:
  58. ```java
  59. <StudentA> <major> <computer_science>
  60. ```
  61. Here *studentA* is subject, *major* is predicate and *computer_science* is object. Basically, the first element is subject, the second is predicate and the third is object. Sometimes it is the object, not an entity nor type. Value like number and string can also become object.
  62. We define 5 kinds of edges:
  63. 1.InEntEdge: The entity is the object of the edge and the subject is also an entity.
  64. 2.OutEntEdge: The entity is the subject of the edge and the object is also an entity.
  65. 3.InEdge: The entity is the object of the edge.
  66. 4.OutEdge: The entity is the subject of the edge.
  67. 5.typeEdge: The entity ts the subject of the edge whose predicate is *type* and its object is a type.
  68. Therefore, the structure of a piece of entity fragment is as follow:
  69. ```java
  70. <entity id> <InEntEdge list> | <OutEntEdge list> | <InEdge list> | <OutEdge list> | <Type list>
  71. ```
  72. Between entity id and InEntEdge list, there should be a \t as divider.
  73. InEntEdge list and OutEntEdge list should be:
  74. ```java
  75. <Subject or object entity id 1> : <Predicate id 1.1> ; <Predicate id 1.2> ; ...... , <Subject or object entity id 2> : <Predicate id 2.1> ; <Predicate id 2.2> ; ......
  76. ```
  77. InEdge, OutEdge and Type list is similar but simpler.
  78. ```java
  79. <Subject or object entity or type id 1> , <Subject or object entity or type id 2>, <Subject or object entity or type id 3>......
  80. ```
  81. Let's go back to our example. For entity *studentA*, its entity fragment should be:
  82. ```java
  83. 1 2:2;5 | 3:1 | 2 | 1,4 | 1
  84. ```
  85. The id of *studentA* is 1. So at the beginning of the entity fragment we have a 1. Then we find InEntEdge, OutEntEdge, InEdge, OutEdge and Type list one by one and add them to the entity fragment.
  86. ### Step 6: Generate type fragment
  87. Given a specific type, type fragment contains three kinds of information: predicate ids in an InEdge of an entity of this type, predicate ids in an OutEdge of an entity of this type, and all the ids of entity of this type. The structure should be:
  88. ```java
  89. <Type id> <InEdge predicate list> | <OutEdge predicate list> | <Entity list>
  90. ```
  91. In our example, the type fragement of *Person* should be:
  92. ```java
  93. 1 2,5 | 1,4 | 1,2
  94. ```
  95. ### Step 7: Generate predicate fragment
  96. Given a specific predicate, there will be more than one piece of predicate fragment. Every piece of predicate fragment comes from a piece triple. We record the types that a predicate may accept as subject or object. Sometimes the object is not an entity and we use *literal* to denote this situation.
  97. The structure of a piece of predicate fragment is:
  98. ```java
  99. [<Type list of the subject entity>] <predicate id> [<Type list of the object entity> or "literal"]
  100. ```
  101. For predicate *friend_of*, the predicate fragment should be:
  102. ```java
  103. [1] 2 [1]
  104. ```
  105. For predicate *name*, the predicate fragment should be:
  106. ```java
  107. [1] 4 literal
  108. ```
  109. Please notice that between type lists, predicate id and "literal", \t should be the divider.
  110. ### Step 8: Rebuild the lucene fragment for entity fragment and type short name
  111. This is the final step to make gAnswer run on our new data fragments. You can find the relative code under src/lcn/BuildIndexForEntityFragments.java and src/lucene/BuildIndexForTypeShortName.java. All you need to do is to import the project into eclipse and modify the file paths in the relative code and then run the main function in src/lcn/BuildIndexForEntityFragments.java and src/lucene/BuildIndexForTypeShortName.java.

GAnswer system is a natural language QA system developed by Institute of Computer Science & Techonology Data Management Lab, Peking University, led by Prof. Zou Lei. GAnswer is able to translate natural language questions to query graphs containing semant