DSCI 524 Group 8 Collaborative Software Development Project
DRY out your regression analysis!
As Data Scientists, being able to perform Exploratory Data Analysis as well as Regression Analysis are paramount to the process of analyzing trends in data. Moreover, following the DRY (Do Not Repeat Yourself) principle is regarded as a majority priority for maximizing code quality. Yet, often times Data Scientists facing these tasks will start the entire process from scratch, wasting both time and effort while compromising code quality. The aridanalysis
package strives to remedy this problem by giving users an easy-to-implement EDA function alongside 3 robust statistical tests that will simplify these analytical processes and produce an easy to read interpretation of the input data. Users will no longer have to write many lines of code to explore their data effectively.
arid_eda
This function takes in the data frame of interest and generates summary statistics as well as basic exploratory data analysis plots to helps users understand the overall behaviour of the explanatory and response variables.
arid_linreg
This function takes in the data frame of interest and performs a regular linear regression with the given regularization and features. The function then outputs an sklearn regression model for prediction and an equivalent statsmodel regression model to provide inference.
arid_logreg
This function takes in a data frame and performs either binomial or multinomial classification based on user inputs. The function then outputs an sklearn logistic regression model for prediction and an equivalent statsmodel logit regression model to provide inference.
arid_countreg
This function takes a dataframe, its categorical and continuous variables and other user inputs to perform a Poisson regression. The function will return a sklearn Poisson regressor model for prediction and a wrapper statsmodel for inference purposes.
import aridanalysis as aa
from vega_datasets import data
>>> dataframe, plots = aa.arid_eda(house_prices,
'price',
'continuous,
['rooms', 'age','garage'])
>>> dataframe, plots = aa.arid_eda(iris_data,
'species',
categorical,
['petalWidth', 'sepalWidth','petalLength'])
tdf = pd.DataFrame(
{
"x1": [1, 0, 0],
"x2": [0, 1.0, 0],
"x3": [0, 0, 1],
"x4": ["a", "a", "b"],
"y": [1, 3, -1.0],
}
)
>>> aa.arid_linreg(tdf, y)
data = [
[32, "male", 80, 0],
[26, "female", 65, 1],
[22, "female", 75, 1],
[36, "male", 85, 0],
[45, "male", 82, 1],
[18, "female", 57, 0],
[57, "male", 60, 1],
]
df = pd.DataFrame(
data,
columns=[
"x1",
"x2",
"x3",
"y"
]
)
>>> aa.arid_logreg(df, y)
df = pd.DataFrame(
{
"x1": ["bad", "good", "bad"],
"x2": [34.56, 34. 21, 19.57],
"y": [6,8,14,],
}
)
>>> aa.arid_countreg(df, y, con_features=[x2], cat_features=[x1], model="additive", alpha=1)
This package will build off the EDA and statistical analysis provided by the Pandas, SKLearn and Statsmodels Python packages to streamline data visualization and model analysis functionality. There are some existing packages that help you with this, however the aridanalysis
package aims to ease the job of going through pandas profiling as well as providing different regression analysis interpretations.
$ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple aridanalysis
The official documentation is hosted on Read the Docs: https://aridanalysis.readthedocs.io/en/latest/
Group 8 Members:
Craig McLaughlin : @cmmclaug
Daniel Ortiz Nunez : @danielon-5
Neel Phaterpekar : @nphaterp
Santiago Rugeles Schoonewolff : @ansarusc
We welcome and recognize all contributions. You can see a list of all current contributors in the contributors tab.
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.