项目作者: noahgift

项目描述 :
Data Engineering: Chapter 5 aws chapter for pragmatic ai. Creates an "real world" Data Engineering API using Flask,Click, Pandas and Swagger docs
高级语言: Python
项目地址: git://github.com/noahgift/pai-aws.git
创建时间: 2018-03-16T18:08:57Z
项目社区:https://github.com/noahgift/pai-aws

开源协议:

下载


MLOPs Python Cookbook with Github Actions

Data Engineering API Example

An example project that shows how to create a Data Engineering API around Flask and Pandas:

Data teams often need to build libraries and services to make it easier to work with data on the platform. In this example there is a need to create a Proof of Concept aggregation of csv data. A REST API that accepts a csv, a column to group on, and a column to aggregate and returns the result.

Note,this project is a Chapter in the book Pragmatic AI, the entire projects source can be found here

Using the default web app.

The Swagger API has some pretty powerful tools built in.

  • To list the plugins that are loaded:

Plugins

  • To apply one of those functions:

Swagger API

Sample Input

  1. first_name,last_name,count
  2. chuck,norris,10
  3. kristen,norris,17
  4. john,lee,3
  5. sam,mcgregor,15
  6. john,mcgregor,19

Sample Output

  1. norris,27
  2. lee,3
  3. mcgregor,34

How to run example and setup environment:

To create environment (tested on OS X 10.12.5), run make setup, which does the following commands below:

  1. mkdir -p ~/.pai-aws && python3 -m venv ~/.pai-aws

Then source the virtualenv. Typically I do it this way, I add an alias to my .zshrc:

  1. alias ntop="cd ~/src/pai-aws && source ~/.pai-aws/bin/activate"

I can then type in: ntop and I cd into my checkout and source a virtualenv. Next, I then make sure I have the latest packages and that linting and tests pass by running make all:

make all

I also like to verify that pylint and pytest and python are exactly the versions I expect, so I added a make command env to conveniently check for these:

```make env

(.pai-aws) ➜ pai-aws git:(master) ✗ make env

Show information about environment

which python3
/Users/noahgift/.pai-aws/bin/python3
python3 —version
Python 3.6.1
which pytest
/Users/noahgift/.pai-aws/bin/pytest
which pylint
/Users/noahgift/.pai-aws/bin/pylint

  1. ## How to interact with Commandline tool (Click Framework):
  2. Check Version:

(.pai-aws) ➜ pai-aws git:(master) ✗ ./csvutil.py —version
csvutil.py, version 0.1

  1. Check Help:

(.pai-aws) ➜ pai-aws git:(master) ✗ ./csvutil.py —help
Usage: csvutil.py [OPTIONS] COMMAND [ARGS]…

CSV Operations Tool

Options:
—version Show the version and exit.
—help Show this message and exit.

  1. Get median
  1. Example Usage:
  2. ./csvcli.py cvsops --file ext/input.csv --groupby last_name --applyname count --func npmedian
  3. Processing csvfile: ext/input.csv and groupby name: last_name and applyname: count
  4. 2017-06-22 14:07:52,532 - nlib.utils - INFO - Loading appliable functions/plugins: npmedian
  5. 2017-06-22 14:07:52,533 - nlib.utils - INFO - Loading appliable functions/plugins: npsum
  6. 2017-06-22 14:07:52,533 - nlib.utils - INFO - Loading appliable functions/plugins: numpy
  7. 2017-06-22 14:07:52,533 - nlib.utils - INFO - Loading appliable functions/plugins: tanimoto
  8. last_name
  9. eagle 17.0
  10. lee 3.0
  11. smith 13.5
  12. Name: count, dtype: float6
  1. Testing a bigger file than the assignment:

./csvcli.py cvsops —file ext/large_input.csv —groupby first_name —applyname count —func npmedian
Processing csvfile: ext/large_input.csv and groupby name: first_name and applyname: count
2021-03-22 12:36:07,677 - nlib.utils - INFO - Loading appliable functions/plugins: npmedian
2021-03-22 12:36:07,677 - nlib.utils - INFO - Loading appliable functions/plugins: npsum
2021-03-22 12:36:07,677 - nlib.utils - INFO - Loading appliable functions/plugins: numpy
2021-03-22 12:36:07,677 - nlib.utils - INFO - Loading appliable functions/plugins: tanimoto
first_name
john 11.0
kristen 17.0
piers 10.0
sam 15.0
Name: count, dtype: float64

  1. ## How to run webapp (primary question) and use API
  2. To run the flask api (if you have followed instructions above), you should be able to run the make command `make start-api`. The output should look like this:

(.pai-aws) ➜ pai-aws git:(master) ✗ make start-api

sets PYTHONPATH to directory above, would do differently in production

cd flaskapp && PYTHONPATH=”..” python web.py
2017-06-17 16:34:15,049 - _main
- INFO - START Flask

  • Running on http://0.0.0.0:5001/ (Press CTRL+C to quit)
  • Restarting with stat
    2017-06-17 16:34:15,473 - main - INFO - START Flask
  • Debugger is active!
  • Debugger PIN: 122-568-160
    2017-06-17 16:34:43,736 - main - INFO - {‘/api/help’: ‘Print available api routes’, ‘/favicon.ico’: ‘The Favicon’, ‘/‘: ‘Home Page’}
    127.0.0.1 - - [17/Jun/2017 16:34:43] “GET / HTTP/1.1” 200 -
    ```

Test Client with Swagger UI

Next, open a web browser to view Swagger API documentation (formatted as HTML):

http://0.0.0.0:5001/apidocs/#/

For example to see swagger docs/UI for cvs aggregate endpoint go here:

http://0.0.0.0:5001/apidocs/#!/default/put_api_aggregate

Interactively Test application in IPython

Using the requests library you can query the api as follows in IPython:

  1. In [1]: import requests, base64
  2. In [2]: url = "http://0.0.0.0:5001/api/npsum"
  3. In [3]: payload = {'column':'count', 'group_by':"last_name"}
  4. In [3]: headers = {'Content-Type': 'application/json'}
  5. In [3]: with open("ext/input.csv", "rb") as f:
  6. ...: data = base64.b64encode(f.read())
  7. In [4]: r = requests.put(url, data=data, params=payload, headers=headers)
  8. In [5]: r.content
  9. Out[5]: b'{"count":{"mcgregor":34,"lee":3,"norris":27}}'

How to simulate Client:

run the client_simulation script

  1. (.pai-aws) tests git:(inperson-interview) python client_simulation.py
  2. status code: 400
  3. response body: {'column': 'count', 'error_msg': 'Query Parameter column or group_by not set', 'group_by': None}
  4. status code: 200
  5. response body: {'first_name': {'3': 'john', '10': 'chuck', '15': 'sam', '17': 'kristen', '19': 'john'}, 'last_name': {'3': 'lee', '10': 'norris', '15': 'mcgregor', '17': 'norris', '19': 'mcgregor'}}

How to interact with python library (nlib):

Typically I use commandline IPython to test libraries that I create. Here is how to ensure the library is working (should be able to get version number):

  1. In [1]: from nlib import csvops
  2. In [2]: df = csvops.ingest_csv("ext/input.csv")
  3. 2017-06-17 17:00:33,973 - nlib.csvops - INFO - CSV to DF conversion with CSV File Path ext/input.csv
  4. In [3]: df.head()
  5. Out[3]:
  6. first_name last_name count
  7. 0 chuck norris 10
  8. 1 kristen norris 17
  9. 2 john lee 3
  10. 3 sam mcgregor 15
  11. 4 john mcgregor 19

Benchmark web Service

Finally, the simplest way to test everything is to use the Makefile to start the web service and then benchmark it (which uploads base64 encoded csv):

  1. (.pai-aws) pai-aws git:(master) make start-api

Then run the apache benchmark via Makefile. The output should look something like this:

  1. (.pai-aws) pai-aws git:(inperson-interview) make benchmark-web
  2. #very simple benchmark of api
  3. ab -n 1000 -c 100 -T 'application/json' -u ext/input_base64.txt http://0.0.0.0:5001/api/aggregate\?column=count\&group_by=last_name
  4. This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
  5. Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
  6. Licensed to The Apache Software Foundation, http://www.apache.org/
  7. Benchmarking 0.0.0.0 (be patient)
  8. Completed 100 requests
  9. Completed 200 requests
  10. Completed 300 requests
  11. Completed 400 requests
  12. Completed 500 requests
  13. Completed 600 requests
  14. Completed 700 requests
  15. Completed 800 requests
  16. Completed 900 requests
  17. Completed 1000 requests
  18. Finished 1000 requests
  19. Server Software: Werkzeug/0.12.2
  20. Server Hostname: 0.0.0.0
  21. Server Port: 5001
  22. Document Path: /api/aggregate?column=count&group_by=last_name
  23. Document Length: 154 bytes
  24. Concurrency Level: 100
  25. Time taken for tests: 7.657 seconds
  26. Complete requests: 1000
  27. Failed requests: 0
  28. Total transferred: 309000 bytes
  29. Total body sent: 308000
  30. HTML transferred: 154000 bytes
  31. Requests per second: 130.60 [#/sec] (mean)
  32. Time per request: 765.716 [ms] (mean)
  33. Time per request: 7.657 [ms] (mean, across all concurrent requests)
  34. Transfer rate: 39.41 [Kbytes/sec] received
  35. 39.28 kb/s sent
  36. 78.69 kb/s total
  37. Connection Times (ms)
  38. min mean[+/-sd] median max
  39. Connect: 0 0 1.1 0 6
  40. Processing: 18 730 142.4 757 865
  41. Waiting: 18 730 142.4 756 865
  42. Total: 23 731 141.3 757 865
  43. Percentage of the requests served within a certain time (ms)
  44. 50% 757
  45. 66% 777
  46. 75% 787
  47. 80% 794
  48. 90% 830
  49. 95% 850
  50. 98% 860
  51. 99% 862
  52. 100% 865 (longest request)

Viewing Juypter Notebooks

They can be found here:
https://github.com/noahgift/pai-aws/blob/inperson-interview/notebooks/api.ipynb

Circle CI Configuration

Circle CI is used to build the project. The configuration file looks like follows:

  1. machine:
  2. python:
  3. version: 3.6.1
  4. dependencies:
  5. pre:
  6. - make install
  7. test:
  8. pre:
  9. - make lint-circleci
  10. - make test-circleci

Those make commands being called are below. They write artifacts to the Circle CI Artifacts Directory:

  1. lint-circleci:
  2. pylint --output-format=parseable --load-plugins pylint_flask --disable=R,C flask_app/*.py nlib csvcli > $$CIRCLE_ARTIFACTS/pylint.html
  3. test-circleci:
  4. @cd tests; pytest -vv --cov-report html:$$CIRCLE_ARTIFACTS --cov=web --cov=nlib test_*.py

The URL for the project build is here: https://circleci.com/gh/noahgift/pai-aws. To see artificats pylint output and/or test coverage output, you can go to the artificats directory here (for build 24):

https://circleci.com/gh/noahgift/pai-aws/24#artifacts/containers/0