NAMA The NAme MAtching tool

Fast, flexible name matching for large datasets

Installation

Recommended install via pip

Create virtual env ``. Optional
Install nama pip install git+https://github.com/bradhackinen/nama.git@master

Install from source with conda

Install Anaconda

Clone nama

git clone https://github.com/bradhackinen/nama.git

Enter the conda directory where the conda environment file is with
```
cd conda
```
Create new conda environment with
```
conda create --name <env-name>
```
Activate the new environment with
```
conda activate <env-name>
```

Download & Install pytorch-mutex

conda install pytorch-mutex-1.0-cuda.tar.bz2

Download & Install pytorch

conda install pytorch-1.10.2-py3.9_cuda11.3_cudnn8.2.0_0.tar.bz2

Install the rest of the dependencies with
```
conda install --file conda_env.txt
```

Exit the conda directory with
```
cd ..
```
Install the package with
```
pip install .
```

Installing from source with pip

Clone nama git clone https://github.com/bradhackinen/nama.git
Create & activate virtual environment python -m venv nama_env && source nama_env/bin/activate
Install dependencies pip install -r requirements.txt
Install the package with pip install ./nama

Install from the project root directory pip install .
Install from another directory pip install /path-to-project-root

Demo

Usage

Using the `Matcher()`

Importing data

To import data into the matcher we can either pass nama a pandas DataFrame with

import nama
training_data = nama.from_df(
    df,
    group_column='group_column',
    string_column='string_column')
print(training_data)

or we can pass nama a .csv file directly

import nama
testing_data = nama.read_csv(
    'path-to-data',
    match_format=match_format,
    group_column=group_column,
    string_column=string_column)
print(training_data)

See from_df & read_csv for parameters and function details

Using the `EmbeddingSimilarityModel()`

Initialation

We can initalize a model like so

from nama.embedding_similarity import EmbeddingSimilarityModel
sim = EmbeddingSimilarityModel()

If using a GPU then we need to send the model to a GPU device like

sim.to(gpu_device)

Training

To train a model we simply need to specifiy the training parmeters and training data

train_kwargs = {
    'max_epochs': 1,
    'warmup_frac': 0.2,
    'transformer_lr':1e-5,
    'score_lr':30,
    'use_counts':False,
    'batch_size':8,
    'early_stopping':False
}
history_df, val_df = sim.train(training_data, verbose=True, **train_kwargs)

We can also save the trained model for later

sim.save("path-to-save-model")

Testing

We can use the model we train above directly like

embeddings = sim.embed(testing_data)

Or load a previously trained model

from nama.embedding_similarity import load_similarity_model
new_sim = load_similarity_model("path-to-saved-model")
embeddings = sim.embed(testing_data)

MORE TO COME