项目作者: bradhackinen

项目描述 :
Fast, flexible name matching for large datasets
高级语言: Python
项目地址: git://github.com/bradhackinen/nama.git
创建时间: 2018-06-22T20:28:41Z
项目社区:https://github.com/bradhackinen/nama

开源协议:GNU General Public License v3.0

下载


NAMA The NAme MAtching tool

Fast, flexible name matching for large datasets

Installation

Recommended install via pip

  1. Create virtual env ``. Optional
  2. Install nama pip install git+https://github.com/bradhackinen/nama.git@master

Install from source with conda

  1. Install Anaconda

  2. Clone nama

    1. git clone https://github.com/bradhackinen/nama.git
  3. Enter the conda directory where the conda environment file is with

    1. cd conda
  4. Create new conda environment with

    1. conda create --name <env-name>
  5. Activate the new environment with

    1. conda activate <env-name>
  6. Download & Install pytorch-mutex

    1. conda install pytorch-mutex-1.0-cuda.tar.bz2
  7. Download & Install pytorch

    1. conda install pytorch-1.10.2-py3.9_cuda11.3_cudnn8.2.0_0.tar.bz2
  8. Install the rest of the dependencies with

    1. conda install --file conda_env.txt
  1. Exit the conda directory with

    1. cd ..
  2. Install the package with

    1. pip install .

Installing from source with pip

  1. Clone nama git clone https://github.com/bradhackinen/nama.git
  2. Create & activate virtual environment python -m venv nama_env && source nama_env/bin/activate
  3. Install dependencies pip install -r requirements.txt
  4. Install the package with pip install ./nama
  • Install from the project root directory pip install .
  • Install from another directory pip install /path-to-project-root

Demo

Usage

Using the Matcher()

Importing data

To import data into the matcher we can either pass nama a pandas DataFrame with

  1. import nama
  2. training_data = nama.from_df(
  3. df,
  4. group_column='group_column',
  5. string_column='string_column')
  6. print(training_data)

or we can pass nama a .csv file directly

  1. import nama
  2. testing_data = nama.read_csv(
  3. 'path-to-data',
  4. match_format=match_format,
  5. group_column=group_column,
  6. string_column=string_column)
  7. print(training_data)

See from_df & read_csv for parameters and function details

Using the EmbeddingSimilarityModel()

Initialation

We can initalize a model like so

  1. from nama.embedding_similarity import EmbeddingSimilarityModel
  2. sim = EmbeddingSimilarityModel()

If using a GPU then we need to send the model to a GPU device like

  1. sim.to(gpu_device)

Training

To train a model we simply need to specifiy the training parmeters and training data

  1. train_kwargs = {
  2. 'max_epochs': 1,
  3. 'warmup_frac': 0.2,
  4. 'transformer_lr':1e-5,
  5. 'score_lr':30,
  6. 'use_counts':False,
  7. 'batch_size':8,
  8. 'early_stopping':False
  9. }
  10. history_df, val_df = sim.train(training_data, verbose=True, **train_kwargs)

We can also save the trained model for later

  1. sim.save("path-to-save-model")

Testing

We can use the model we train above directly like

  1. embeddings = sim.embed(testing_data)

Or load a previously trained model

  1. from nama.embedding_similarity import load_similarity_model
  2. new_sim = load_similarity_model("path-to-saved-model")
  3. embeddings = sim.embed(testing_data)

MORE TO COME