项目作者: justinhchae

项目描述 :
A helpful package to streamline Pandas DataFrame optimization.
高级语言: Python
项目地址: git://github.com/justinhchae/pd-helper.git
创建时间: 2021-04-07T21:36:35Z
项目社区:https://github.com/justinhchae/pd-helper

开源协议:MIT License

下载


pd-helper

A helpful package to streamline Pandas DataFrame optimization.

Save 50-75% on DataFrame memory usage by running the optimizer.

Autoconfigure dtypes for appropriate data types in each column with helper.

Generate a random DataFrame of controlled random variables for testing with maker.

Install

  1. pip install pd-helper

Basic Usage to Iterate over DataFrame

  1. from pd_helper.maker import MakeData
  2. from pd_helper.helper import optimize
  3. faker = MakeData()
  4. if __name__ == "__main__":
  5. # MakeData() generates a fake dataframe, convenient for testing
  6. df = faker.make_df()
  7. df = optimize(df)

Better Usage With Multiprocessing

  1. from pd_helper.maker import MakeData
  2. from pd_helper.helper import optimize
  3. faker = MakeData()
  4. if __name__ == "__main__":
  5. # MakeData() generates a fake dataframe, convenient for testing
  6. df = faker.make_df()
  7. df = optimize(df, enable_mp=True)

Specify Special Mappings

  1. from pd_helper.maker import MakeData
  2. from pd_helper.helper import optimize
  3. faker = MakeData()
  4. if __name__ == "__main__":
  5. # MakeData() generates a fake dataframe, convenient for testing
  6. df = faker.make_df()
  7. special_mappings = {'string': ['object_id'],
  8. 'category': ['item_name']}
  9. # special mappings will be applied instead of by optimize ruleset, they will be returned.
  10. df = optimize(df
  11. , enable_mp=True,
  12. special_mappings=special_mappings
  13. )

Sample Results with Helper

  1. Starting with 175.63 MB memory.
  2. After optmization.
  3. Ending with 65.33 MB memory.

Generating a Randomly Imperfect DataFrame with Maker

Maker provides a class, MakeData(), to generate a table of made-up records.

Each row is an event where an item was retrieved.

Options to make the table imperfectly random in various ways.

Sample table below:

Retrieved Date Item Name Retrieved Condition Sector
Example 2019-01-01, 2019-03-4 Toaster, Lighter True, False Junk, Excellent 1, 2
Data Type String String String String Integer

References

TODO

  • Improve efficiency of iterating on DataFrame.

  • Allow user to toggle logging.

  • Provide tools for imputing missing data.