pd-helper

A helpful package to streamline Pandas DataFrame optimization.

Save 50-75% on DataFrame memory usage by running the optimizer.

Autoconfigure dtypes for appropriate data types in each column with helper.

Generate a random DataFrame of controlled random variables for testing with maker.

Install

 pip install pd-helper

Basic Usage to Iterate over DataFrame

from pd_helper.maker import MakeData 
from pd_helper.helper import optimize
faker = MakeData()
if __name__ == "__main__":
   # MakeData() generates a fake dataframe, convenient for testing
   df = faker.make_df()
   df = optimize(df)

Better Usage With Multiprocessing

from pd_helper.maker import MakeData 
from pd_helper.helper import optimize
faker = MakeData()
if __name__ == "__main__":
   # MakeData() generates a fake dataframe, convenient for testing
   df = faker.make_df()
   df = optimize(df, enable_mp=True)

Specify Special Mappings

from pd_helper.maker import MakeData 
from pd_helper.helper import optimize
faker = MakeData()
if __name__ == "__main__":
   # MakeData() generates a fake dataframe, convenient for testing
   df = faker.make_df()
   special_mappings = {'string': ['object_id'],
                       'category': ['item_name']}
   # special mappings will be applied instead of by optimize ruleset, they will be returned.
   df = optimize(df
                 , enable_mp=True,
                 special_mappings=special_mappings
                 )

Sample Results with Helper

Starting with 175.63 MB memory.
After optmization. 
Ending with 65.33 MB memory.

Generating a Randomly Imperfect DataFrame with Maker

Maker provides a class, MakeData(), to generate a table of made-up records.

Each row is an event where an item was retrieved.

Options to make the table imperfectly random in various ways.

Sample table below:

	Retrieved Date	Item Name	Retrieved	Condition	Sector
Example	2019-01-01, 2019-03-4	Toaster, Lighter	True, False	Junk, Excellent	1, 2
Data Type	String	String	String	String	Integer

References

Pandas Categorical: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html
Pandas Pickle: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html
Pandas CSV: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Pandas Datetime: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

TODO

Improve efficiency of iterating on DataFrame.
Allow user to toggle logging.
Provide tools for imputing missing data.