项目作者: mouradmourafiq

项目描述 :
An extension to pandas dataframes describe function.
高级语言: Python
项目地址: git://github.com/mouradmourafiq/pandas-summary.git
创建时间: 2016-03-25T21:59:32Z
项目社区:https://github.com/mouradmourafiq/pandas-summary

开源协议:MIT License

下载


License: Apache 2
TraceML
Slack
Docs
GitHub
GitHub

TraceML

Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

Install

  1. pip install traceml

If you would like to use the tracking features, you need to install polyaxon as well:

  1. pip install polyaxon traceml

[WIP] Local sandbox

Coming soon

Offline usage

You can enable the offline mode to track runs without an API:

  1. export POLYAXON_OFFLINE="true"

Or passing the offline flag

  1. from traceml import tracking
  2. tracking.init(..., is_offline=True, ...)

Simple usage in a Python script

  1. import random
  2. import traceml as tracking
  3. tracking.init(
  4. is_offline=True,
  5. project='quick-start',
  6. name="my-new-run",
  7. description="trying TraceML",
  8. tags=["examples"],
  9. artifacts_path="path/to/artifacts/repo"
  10. )
  11. # Tracking some data refs
  12. tracking.log_data_ref(content=X_train, name='x_train')
  13. tracking.log_data_ref(content=y_train, name='y_train')
  14. # Tracking inputs
  15. tracking.log_inputs(
  16. batch_size=64,
  17. dropout=0.2,
  18. learning_rate=0.001,
  19. optimizer="Adam"
  20. )
  21. def get_loss(step):
  22. result = 10 / (step + 1)
  23. noise = (random.random() - 0.5) * 0.5 * result
  24. return result + noise
  25. # Track metrics
  26. for step in range(100):
  27. loss = get_loss(step)
  28. tracking.log_metrics(
  29. loss=loss,
  30. accuracy=(100 - loss) / 100.0,
  31. )
  32. # Track some one time results
  33. tracking.log_outputs(validation_score=0.66)
  34. # Optionally manually stop the tracking process
  35. tracking.stop()

Integration with deep learning and machine learning libraries and frameworks

Keras

You can use TraceML’s callback to automatically save all metrics and collect outputs and models, you can also track additional information using the logging methods:

  1. from traceml import tracking
  2. from traceml.integrations.keras import Callback
  3. tracking.init(
  4. is_offline=True,
  5. project='tracking-project',
  6. name="keras-run",
  7. description="trying TraceML & Keras",
  8. tags=["examples"],
  9. artifacts_path="path/to/artifacts/repo"
  10. )
  11. tracking.log_inputs(
  12. batch_size=64,
  13. dropout=0.2,
  14. learning_rate=0.001,
  15. optimizer="Adam"
  16. )
  17. tracking.log_data_ref(content=x_train, name='x_train')
  18. tracking.log_data_ref(content=y_train, name='y_train')
  19. tracking.log_data_ref(content=x_test, name='x_test')
  20. tracking.log_data_ref(content=y_test, name='y_test')
  21. # ...
  22. model.fit(
  23. x_train,
  24. y_train,
  25. validation_data=(X_test, y_test),
  26. epochs=epochs,
  27. batch_size=100,
  28. callbacks=[Callback()],
  29. )

PyTorch

You can log metrics, inputs, and outputs of Pytorch experiments using the tracking module:

  1. from traceml import tracking
  2. tracking.init(
  3. is_offline=True,
  4. project='tracking-project',
  5. name="pytorch-run",
  6. description="trying TraceML & PyTorch",
  7. tags=["examples"],
  8. artifacts_path="path/to/artifacts/repo"
  9. )
  10. tracking.log_inputs(
  11. batch_size=64,
  12. dropout=0.2,
  13. learning_rate=0.001,
  14. optimizer="Adam"
  15. )
  16. # Metrics
  17. for batch_idx, (data, target) in enumerate(train_loader):
  18. output = model(data)
  19. loss = F.nll_loss(output, target)
  20. loss.backward()
  21. optimizer.step()
  22. tracking.log_metrics(loss=loss)
  23. asset_path = tracking.get_outputs_path('model.ckpt')
  24. torch.save(model.state_dict(), asset_path)
  25. # log model
  26. tracking.log_artifact_ref(asset_path, framework="pytorch", ...)

Tensorflow

You can log metrics, outputs, and models of Tensorflow experiments and distributed Tensorflow experiments using the tracking module:

  1. from traceml import tracking
  2. from traceml.integrations.tensorflow import Callback
  3. tracking.init(
  4. is_offline=True,
  5. project='tracking-project',
  6. name="tf-run",
  7. description="trying TraceML & Tensorflow",
  8. tags=["examples"],
  9. artifacts_path="path/to/artifacts/repo"
  10. )
  11. tracking.log_inputs(
  12. batch_size=64,
  13. dropout=0.2,
  14. learning_rate=0.001,
  15. optimizer="Adam"
  16. )
  17. # log model
  18. estimator.train(hooks=[Callback(log_image=True, log_histo=True, log_tensor=True)])

Fastai

You can log metrics, outputs, and models of Fastai experiments using the tracking module:

  1. from traceml import tracking
  2. from traceml.integrations.fastai import Callback
  3. tracking.init(
  4. is_offline=True,
  5. project='tracking-project',
  6. name="fastai-run",
  7. description="trying TraceML & Fastai",
  8. tags=["examples"],
  9. artifacts_path="path/to/artifacts/repo"
  10. )
  11. # Log model metrics
  12. learn.fit(..., cbs=[Callback()])

Pytorch Lightning

You can log metrics, outputs, and models of Pytorch Lightning experiments using the tracking module:

  1. from traceml import tracking
  2. from traceml.integrations.pytorch_lightning import Callback
  3. tracking.init(
  4. is_offline=True,
  5. project='tracking-project',
  6. name="pytorch-lightning-run",
  7. description="trying TraceML & Lightning",
  8. tags=["examples"],
  9. artifacts_path="path/to/artifacts/repo"
  10. )
  11. ...
  12. trainer = pl.Trainer(
  13. gpus=0,
  14. progress_bar_refresh_rate=20,
  15. max_epochs=2,
  16. logger=Callback(),
  17. )

HuggingFace

You can log metrics, outputs, and models of HuggingFace experiments using the tracking module:

  1. from traceml import tracking
  2. from traceml.integrations.hugging_face import Callback
  3. tracking.init(
  4. is_offline=True,
  5. project='tracking-project',
  6. name="hg-run",
  7. description="trying TraceML & HuggingFace",
  8. tags=["examples"],
  9. artifacts_path="path/to/artifacts/repo"
  10. )
  11. ...
  12. trainer = Trainer(
  13. model=model,
  14. args=training_args,
  15. train_dataset=train_dataset if training_args.do_train else None,
  16. eval_dataset=eval_dataset if training_args.do_eval else None,
  17. callbacks=[Callback],
  18. # ...
  19. )

Tracking artifacts

  1. import altair as alt
  2. import matplotlib.pyplot as plt
  3. import numpy as np
  4. import plotly.express as px
  5. from bokeh.plotting import figure
  6. from vega_datasets import data
  7. from traceml import tracking
  8. def plot_mpl_figure(step):
  9. np.random.seed(19680801)
  10. data = np.random.randn(2, 100)
  11. figure, axs = plt.subplots(2, 2, figsize=(5, 5))
  12. axs[0, 0].hist(data[0])
  13. axs[1, 0].scatter(data[0], data[1])
  14. axs[0, 1].plot(data[0], data[1])
  15. axs[1, 1].hist2d(data[0], data[1])
  16. tracking.log_mpl_image(figure, 'mpl_image', step=step)
  17. def log_bokeh(step):
  18. factors = ["a", "b", "c", "d", "e", "f", "g", "h"]
  19. x = [50, 40, 65, 10, 25, 37, 80, 60]
  20. dot = figure(title="Categorical Dot Plot", tools="", toolbar_location=None,
  21. y_range=factors, x_range=[0, 100])
  22. dot.segment(0, factors, x, factors, line_width=2, line_color="green", )
  23. dot.circle(x, factors, size=15, fill_color="orange", line_color="green", line_width=3, )
  24. factors = ["foo 123", "bar:0.2", "baz-10"]
  25. x = ["foo 123", "foo 123", "foo 123", "bar:0.2", "bar:0.2", "bar:0.2", "baz-10", "baz-10",
  26. "baz-10"]
  27. y = ["foo 123", "bar:0.2", "baz-10", "foo 123", "bar:0.2", "baz-10", "foo 123", "bar:0.2",
  28. "baz-10"]
  29. colors = [
  30. "#0B486B", "#79BD9A", "#CFF09E",
  31. "#79BD9A", "#0B486B", "#79BD9A",
  32. "#CFF09E", "#79BD9A", "#0B486B"
  33. ]
  34. hm = figure(title="Categorical Heatmap", tools="hover", toolbar_location=None,
  35. x_range=factors, y_range=factors)
  36. hm.rect(x, y, color=colors, width=1, height=1)
  37. tracking.log_bokeh_chart(name='confusion-bokeh', figure=hm, step=step)
  38. def log_altair(step):
  39. source = data.cars()
  40. brush = alt.selection(type='interval')
  41. points = alt.Chart(source).mark_point().encode(
  42. x='Horsepower:Q',
  43. y='Miles_per_Gallon:Q',
  44. color=alt.condition(brush, 'Origin:N', alt.value('lightgray'))
  45. ).add_selection(
  46. brush
  47. )
  48. bars = alt.Chart(source).mark_bar().encode(
  49. y='Origin:N',
  50. color='Origin:N',
  51. x='count(Origin):Q'
  52. ).transform_filter(
  53. brush
  54. )
  55. chart = points & bars
  56. tracking.log_altair_chart(name='altair_chart', figure=chart, step=step)
  57. def log_plotly(step):
  58. df = px.data.tips()
  59. fig = px.density_heatmap(df, x="total_bill", y="tip", facet_row="sex", facet_col="smoker")
  60. tracking.log_plotly_chart(name="2d-hist", figure=fig, step=step)
  61. plot_mpl_figure(100)
  62. log_bokeh(100)
  63. log_altair(100)
  64. log_plotly(100)

Tracking DataFrames

Summary

An extension to pandas dataframes describe function.

The module contains DataFrameSummary object that extend describe() with:

  • properties
    • dfs.columns_stats: counts, uniques, missing, missing_perc, and type per column
    • dsf.columns_types: a count of the types of columns
    • dfs[column]: more in depth summary of the column
  • function
    • summary(): extends the describe() function with the values with columns_stats

The DataFrameSummary expect a pandas DataFrame to summarise.

  1. from traceml.summary.df import DataFrameSummary
  2. dfs = DataFrameSummary(df)

getting the columns types

  1. dfs.columns_types
  2. numeric 9
  3. bool 3
  4. categorical 2
  5. unique 1
  6. date 1
  7. constant 1
  8. dtype: int64

getting the columns stats

  1. dfs.columns_stats
  2. A B C D E
  3. counts 5802 5794 5781 5781 4617
  4. uniques 5802 3 5771 128 121
  5. missing 0 8 21 21 1185
  6. missing_perc 0% 0.14% 0.36% 0.36% 20.42%
  7. types unique categorical numeric numeric numeric

getting a single column summary, e.g. numerical column

  1. # we can also access the column using numbers A[1]
  2. dfs['A']
  3. std 0.2827146
  4. max 1.072792
  5. min 0
  6. variance 0.07992753
  7. mean 0.5548516
  8. 5% 0.1603367
  9. 25% 0.3199776
  10. 50% 0.4968588
  11. 75% 0.8274732
  12. 95% 1.011255
  13. iqr 0.5074956
  14. kurtosis -1.208469
  15. skewness 0.2679559
  16. sum 3207.597
  17. mad 0.2459508
  18. cv 0.5095319
  19. zeros_num 11
  20. zeros_perc 0,1%
  21. deviating_of_mean 21
  22. deviating_of_mean_perc 0.36%
  23. deviating_of_median 21
  24. deviating_of_median_perc 0.36%
  25. top_correlations {u'D': 0.702240243124, u'E': -0.663}
  26. counts 5781
  27. uniques 5771
  28. missing 21
  29. missing_perc 0.36%
  30. types numeric
  31. Name: A, dtype: object

[WIP] Summaries

  • Add summary analysis between columns, i.e. dfs[[1, 2]]

[WIP] Visualizations

  • Add summary visualization with matplotlib.
  • Add summary visualization with plotly.
  • Add summary visualization with altair.
  • Add predefined profiling.

[WIP] Catalog and Versions

  • Add possibility to persist summary and link to a specific version.
  • Integrate with quality libraries.