Document search software with using inverted-index data structure.
In this project, the given library folder which has txt files are scanned for searching word and its number of occurences
in multiple txt files.
Inverted Index structure is used for storing data. In this structure, each word is parsed and then stored with the path
of txt files which have this word and the number of occurences of this word in each txt file.
The following is a sample tabular form of this structure:
Name
animals/cats/doc1.txt (2),
ceng/linux/centos.txt (1),
history/asia/china.txt(5)
Genius
Science/chemistry/marie_curie.txt (6),
Science/physics/heisenberg.txt (5)
After this user enters word query and the word is retrived and printed to user. In this output, paths are
ordered according to number of occurrences. This project consists of following modules:
Requirements are as follows:
First C++17 compliant g++ toolchain must be installed. In this project g++-8 is installed. You can install by typing following commands:
If the last command gives correct output it is done.
Then the make must be installed. In this project GNU Make 4.1 is used and installed by above commands.
Finally you must install cmake by going the following link and follow README file for installing it:
https://cmake.org/download/
After completing all the steps gdownload the googletest in the following link:
https://github.com/google/googletest/releases/tag/release-1.10.0
Extract googletest and go into directory and type the following commands:
If doing so yields proper outputs, it is installed properly.
Then clone the repository of Document Search project, go into the folder and create build dir by typing
and go into the folder by typing
Then run the following command for compile CMakeLists.txt:
If it completes properly run the following command to build project:
In the current directory, build, you can type the following command line utility for indexing library folder:
You can change your library path by your custom library path. Be careful with slash character between directory name.
In linux it is / and in Windows it is \ which is escape sequence so you must double each slash like home\directory\
After indexing this sample library you can query any word by replacing wordToSearch with yours:
For running all the test go into test dir in build by typing
and then type
For running project in Windows, the user should follow the following instructions which is tested:
The Msys 2.0 must be installed by following the below link:
After the first step is done, user must install MinGW, make and MinGW by going the setup
directory of msys2 which is named msys64 for 64 bit pc.
Upon reaching the folder, msys2.exe must be clicked and the command line of it will be
opened. Here you must enter bunch of commands to make your environment prepared. The
commands are as following:
On 64 bits computer the following:
pacman -S mingw-w64-x86_64-gcc
Then for installing make:
pacman -S mingw64/mingw-w64-x86_64-make
Then for installing cmake:
pacman -S mingw-w64-x86_64-cmake
After all the tools are installed, download googletest below, which is the latest version
at current time:
https://github.com/google/googletest/releases/tag/release-1.10.0
After downloading the googletest, extract it into the directory of msys64->home->username.
Then close msys command line and go back to the directory mysy64 and click mingw64 or 32 for opening
their command line.
Go into googletest dir by typing -> cd googletest….. directory. Then create build dir by typing
mkdir build and go into build by typing cd build.
After that run following commands for building googletest:
After succesfull building get git link of Document Search Project by using git clone link.
Enter to the downloaded directory and run following commands:
Following commands must be entered in build directory:
For indexing library type:
./main.exe -index ../library
For printing the information of any word in files type:
./main.exe -search word
For testing, enter to the test folder in build by;
cd test/
and then run the exe file by;
./text_main.exe
and all the tests will be runned.