Multi-hop knowledge graph reasoning learned via policy gradient with reward shaping and action dropout