You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

How_to_generate_fragments.md 3.9 kB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
  1. ## How to generate fragments out of your own triples
  2. There are three kinds of fragments in gAnswer: entity fragments, predicate fragments and type fragments. They are information extracted from the triples helping gAnswer improve its results. In this section we will show you how to generate your own fragments step by step with a simple example
  3. ### Step 1: Clean the triple files
  4. Suppose we have a triple file containing only seven triples:
  5. ```java
  6. <StudentA> <major> <computer_science>
  7. <StudentB> <friend_of> <StudentA>
  8. <StudentA> <name> "Jeff"
  9. <StudentB> <name> "Tom"
  10. <StudentA> <type> <Person>
  11. <StudentB> <type> <Person>
  12. <computer_science> <type> <Subject>
  13. ```
  14. Generally speaking, there are three segment
  15. This is the exactly form of triples we need to generate fragments. However sometimes the entity and predicate contain some extra information. Take dbpedia dataset as an example. The following is the original form of a dbpedia triple
  16. ```java
  17. <http://dbpedia.org/resource/Alabama> <http://dbpedia.org/property/demonym> <http://dbpedia.org/resource/Adjectivals_and_demonyms_for_U.S._states> .
  18. ```
  19. As you can see, every entity and predicate is marked with an URI, but we don't need the prefix of the URIs. See Step1_clean_triples.py. That is the code we use to clean dbpedia triples.
  20. Generally, please remember that making sure the entity and predicate names are clear enough to indicate their true meaning and contain no extra information is all you need to do in this step.
  21. By the way, if you have more than one triple files, please combine them into one so that the following steps will be easier.
  22. ### Step 2: remove duplicate triples
  23. One triple may occur more than once in the clean triple file, especially when you combine many triple files into one.
  24. gAnswer is OK with receiving duplicate triples but it will influence its performance.
  25. ### Step 3: extract entity, predicate and type name for id allocation
  26. To save space cost, the fragment files are not constructed based on entity, predicate and type names themselves but their ids. Therefore, we must extract every entity, predicate and type name out of the triple file and give them a uniue id respectively. In our example,the id files will goes like this:
  27. ```java
  28. //Entity ids
  29. <StudentA> 1
  30. <StudentB> 2
  31. <computer_science> 3
  32. //predicate ids
  33. <major> 1
  34. <friend_of> 2
  35. <type> 3
  36. <name> 4
  37. //type ids
  38. <Person> 1
  39. <Subject> 2
  40. ```
  41. ### Step 4: represent triples with ids
  42. For convenience, before we generate the fragments, we first replace all the name strings in triple file with corresponding ids.
  43. In our example, the new triple file is like:
  44. ```java
  45. 1 1 3
  46. 2 2 1
  47. 1 4 -1
  48. 2 4 -1
  49. 1 3 1
  50. 2 3 1
  51. 3 3 2
  52. ```
  53. Notice that we use -1 to represent values that a not entity nor type, such as numbers and literals.
  54. ### Step 5: generate entity fragments
  55. Finally we are going to generate entity fragments now. Every entity has its own piece of fragment.Fragments are information about the edges related with the entity as well as its neighbor entities.First let's clearify the idea of subject and object in a triple. A triple consist of three parts: subject, predicate and object. For example:
  56. ```java
  57. <StudentA> <major> <computer_science>
  58. ```
  59. Here *studentA* is subject, *major* is predicate and *computer_science* is object. Basically, the first element is subject, the second is predicate and the third is object. Sometimes it is the object, not an entity nor type. Value like number and string can also become object.
  60. We define 5 kinds of edges:
  61. 1.InEntEdge: The entity is the object of the edge and the subject is also an entity.
  62. 2.OutEntEdge: The entity is the subject of the edge and the object is also an entity.
  63. 3.InEdge: The entity is the object of the edge.
  64. 4.OutEdge: The entity is the subject of the edge.
  65. 5.typeEdge: The entity ts the subject of the edge whose predicate is *type* and its object is a type.

GAnswer system is a natural language QA system developed by Institute of Computer Science & Techonology Data Management Lab, Peking University, led by Prof. Zou Lei. GAnswer is able to translate natural language questions to query graphs containing semant