You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

step2_dedubplicate.py 707 B

12345678910111213141516171819202122232425262728293031
  1. # encoding=utf-8
  2. '''
  3. Step2: remove the dubplicate triples.
  4. '''
  5. triples = set()
  6. j = 1
  7. i = 1
  8. with open('./pkubase/pkubase-triples.txt','r') as f:
  9. while 1:
  10. line = f.readline()
  11. if not line:
  12. break
  13. triples.add(line)
  14. if j % 100000 == 0:
  15. print("%d:%d"%(i,j))
  16. j += 1
  17. j = 1
  18. i = 2
  19. with open('./pkubase/pkubase-types.txt','r') as f:
  20. while 1:
  21. line = f.readline()
  22. if not line:
  23. break
  24. triples.add(line)
  25. if j % 100000 == 0:
  26. print("%d:%d"%(i,j))
  27. j += 1
  28. print(len(triples))
  29. wf = open('./pkubase/pkubase_clean.txt','w')
  30. for item in triples:
  31. wf.write(item)

GAnswer system is a natural language QA system developed by Institute of Computer Science & Techonology Data Management Lab, Peking University, led by Prof. Zou Lei. GAnswer is able to translate natural language questions to query graphs containing semant