项目作者: SHRMu

项目描述 :
Suggestion of Context Sensitive Search Terms using Word and Sentence Embeddings
高级语言: Java
项目地址: git://github.com/SHRMu/Entity-Context-Based-Search-Suggestion.git


Entity-Context-Based Search Suggestion

TU Darmstadt Summer Term 2019 Data Management Project

Introdcution

Elasticsearch is now a very popular search engine based on Lucene. But we think, in future the simple keywords search is not enough.
Therefore we consider to merge the machine learning technology, especially word embeddings into traditional search engine.

Implementation

  1. Elasticsearch development framework was forked from https://github.com/panholly/esfilesearch.

  2. Tensorflow model was trained and saved in Google Colaboratory, and then loaded by java code locally.

  3. Loaded training model used for entity suggestion, the entity autocompletion function is implemented based on trieTree structure

Environment

Elasticsearch 7.2.0

Mysql 8.0.13

Entity Embedding

rather than using single word embedding in traiditional NLP task, in this assignment we use the specific entity embedding
for model training and predicting.

Result

Valid result after 50 epochs with windows_size = 10 :

  • Nearest to donald_trump: toby_keith, yemen, superfund, eric_schmidt, max_rose, adam_goldman, united_states_office_of_special_counsel, appalachian_trail,
  • Nearest to china: xi_jinping, central_military_commission, china_daily, lindsay_kemp, dandong, ashok_rajagopalan, rupert_brooke, forum_on_chinaafrica_cooperation,
  • Nearest to barack_obama: melania_trump, bessie_coleman, werner_heisenberg, victor_trumper, howard_county, eileen_atkins, bobby_fischer, fayez_alsarraj,
  • Nearest to angela_merkel: christian_democratic_union_of_germany, germany, schlumberger, danube, berlin, arab_world, wiesbaden, friedrich_merz,
  • Nearest to harry_potter: j_k_rowling, shannon_hale, sheryl_crow, hogwarts, citizens_united_v_fec, ellen_muth, sofia, h_a_hellyer,
  • Nearest to olympic_games: international_olympic_committee, toshir_mut, yuriko_koike, board_of_audit, bykada, american_banker, berkeley_heights, uur_erdener,
  • Nearest to wikipedia: gerontology_research_group, konstantin_novoselov, wikipedia_community, battle_of_gettysburg, katherine_harris, college_of_william__mary, mv_tsgt_john_a_chapman, church_square,
  • Nearest to alibaba_group: jack_ma, shanghai, claude_taylor, saeb_erekat, qingdao, national_retail_federation, domain_name_system, tokyo_stock_exchange,

Figure

50 example entity vectors

entity_vector_similarity

Website demo

website_demo