You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

step3_split.py 1.1 kB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
  1. # encoding=utf-8
  2. '''
  3. Step3: extract entity, type and predicate out of the original triple files and allocate ids
  4. '''
  5. entities = set()
  6. types = set()
  7. predicate = set()
  8. with open('triple file here','r') as f:
  9. i = 1
  10. k = 0
  11. for line in f.readlines():
  12. tri = line[:-2].split('\t')
  13. entities.add(tri[0])
  14. predicate.add(tri[1])
  15. if len(tri)==2:
  16. print("%s:%d"%(line,i))
  17. i += 1
  18. k += 1
  19. print(tri)
  20. continue
  21. if '"' in tri[2][0] or '"' in tri[2][0]:
  22. continue
  23. entities.add(tri[2])
  24. if tri[1]=='<type>':
  25. types.add(tri[2])
  26. if i%10000 == 0:
  27. print(i)
  28. i += 1
  29. print(i)
  30. print(k)
  31. e = open('entity id file','w')
  32. t = open('type id file','w')
  33. p = open('predicate id file','w')
  34. k = 0
  35. for item in entities:
  36. if item[-1]!='\n':
  37. e.write(item+'\t%d'%k+'\n')
  38. else:
  39. e.write(item[:-1]+'\t%d'%k+'\n')
  40. k += 1
  41. k = 0
  42. for item in types:
  43. if item[-1]!='\n':
  44. t.write(item+'\t%d'%k+'\n')
  45. else:
  46. t.write(item[:-1]+'\t%d'%k+'\n')
  47. k += 1
  48. k = 0
  49. for item in predicate:
  50. if item[-1]!='\n':
  51. p.write(item+'\t%d'%k+'\n')
  52. else:
  53. p.write(item[:-1]+'\t%d'%k+'\n')
  54. k += 1

GAnswer system is a natural language QA system developed by Institute of Computer Science & Techonology Data Management Lab, Peking University, led by Prof. Zou Lei. GAnswer is able to translate natural language questions to query graphs containing semant