Fast, flexible name matching for large datasets
Fast, flexible name matching for large datasets
Recommended install via pip
pip install git+https://github.com/bradhackinen/nama.git@master
Install from source with conda
Install Anaconda
Clone nama
git clone https://github.com/bradhackinen/nama.git
Enter the conda
directory where the conda environment file is with
cd conda
Create new conda environment with
conda create --name <env-name>
Activate the new environment with
conda activate <env-name>
Download & Install pytorch-mutex
conda install pytorch-mutex-1.0-cuda.tar.bz2
Download & Install pytorch
conda install pytorch-1.10.2-py3.9_cuda11.3_cudnn8.2.0_0.tar.bz2
Install the rest of the dependencies with
conda install --file conda_env.txt
Exit the conda
directory with
cd ..
Install the package with
pip install .
Installing from source with pip
nama
git clone https://github.com/bradhackinen/nama.git
python -m venv nama_env && source nama_env/bin/activate
pip install -r requirements.txt
pip install ./nama
pip install .
pip install /path-to-project-root
Matcher()
To import data into the matcher we can either pass nama
a pandas DataFrame with
import nama
training_data = nama.from_df(
df,
group_column='group_column',
string_column='string_column')
print(training_data)
or we can pass nama
a .csv file directly
import nama
testing_data = nama.read_csv(
'path-to-data',
match_format=match_format,
group_column=group_column,
string_column=string_column)
print(training_data)
See from_df
& read_csv
for parameters and function details
EmbeddingSimilarityModel()
We can initalize a model like so
from nama.embedding_similarity import EmbeddingSimilarityModel
sim = EmbeddingSimilarityModel()
If using a GPU then we need to send the model to a GPU device like
sim.to(gpu_device)
To train a model we simply need to specifiy the training parmeters and training data
train_kwargs = {
'max_epochs': 1,
'warmup_frac': 0.2,
'transformer_lr':1e-5,
'score_lr':30,
'use_counts':False,
'batch_size':8,
'early_stopping':False
}
history_df, val_df = sim.train(training_data, verbose=True, **train_kwargs)
We can also save the trained model for later
sim.save("path-to-save-model")
We can use the model we train above directly like
embeddings = sim.embed(testing_data)
Or load a previously trained model
from nama.embedding_similarity import load_similarity_model
new_sim = load_similarity_model("path-to-saved-model")
embeddings = sim.embed(testing_data)
MORE TO COME