项目作者: Dineshkarthik

项目描述 :
A curated list of Data Science and Engineering frameworks, tools, libraries and related list of tutorials.
高级语言:
项目地址: git://github.com/Dineshkarthik/awesome-data-science-and-engineering.git
创建时间: 2019-06-19T13:19:51Z
项目社区:https://github.com/Dineshkarthik/awesome-data-science-and-engineering

开源协议:MIT License

下载


Data Science & Engineering

A curated list of Data Science and Engineering frameworks, tools, libraries and related list of tutorials.
This mostly covers python related opensource ones ranging from beginner to intermediate levels.

Table of Contents

Big Data

PySpark - Apache Spark Python API - pypi

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Study Material

Frameworks

Apache Airflow - pypi

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows.

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Libraries

Pandas - pypi

Library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

NumPy - pypi

NumPy is the fundamental package for scientific computing with Python.
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Alembic - pypi

Alembic is a database migrations tool written by the author of SQLAlchemy. A migrations tool offers the following functionality:

  1. Can emit ALTER statements to a database in order to change the structure of tables and other constructs
  2. Provides a system whereby “migration scripts” may be constructed; each script indicates a particular series of steps that can “upgrade” a target database to a new version, and optionally a series of steps that can “downgrade” similarly, doing the same steps in reverse.
  3. Allows the scripts to execute in some sequential manner.

Tools

JupyterLab

JupyterLab is the next-generation web-based user interface for Project Jupyter.

JupyterLab enables you to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. You canarrange multiple documents and activities side by side in the work area using tabs and splitters. Documents and activities integrate with each other, enabling new workflows for interactive computing.

Google Colab

Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.

With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.