http://insideairbnb.com/
use pipx to install cookiecutter
$ pipx install cookiecutter
We are going to base our project on the Data Science cookiecutter
$ cookiecutter https://github.com/drivendata/cookiecutter-data-science
project_name [project_name]: price_forecaster
repo_name [price_forecaster]:
author_name [Your name (or your organization/company/team)]: Anders Bogsnes
description [A short description of the project.]: Predicting Airbnb Prices
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: 1
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creators initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
src/<package_name>
layout (https://hynek.me/articles/testing-packaging/#src)tests
directory to put our tests inThe rest is up to you to decide as we work through the project
$ cd price_forecaster/src
$ mkdir price_forecaster
# Ignore the error - everything should have worked
$ mv * price_forecaster
# Get back to our project root
$ cd ..
$ pyenv install 3.8.5
$ pyenv virtualenv 3.8.5 price_forecaster
$ pyenv local price_forecaster
$ poetry init
Package name [price_forecaster]:
Version [0.1.0]:
Description []: Airbnb price forecaster
Author [Anders Bogsnes <andersbogsnes@gmail.com>, n to skip]:
License []: MIT
Compatible Python versions [^3.8]:
Would you like to define your main dependencies interactively? (yes/no) [yes] no
Would you like to define your development dependencies interactively? (yes/no) [yes] no
Generated file
[tool.poetry]
name = "price_forecaster"
version = "0.1.0"
description = "Airbnb price forecaster"
authors = ["Anders Bogsnes <andersbogsnes@gmail.com>"]
license = "MIT"
[tool.poetry.dependencies]
python = "^3.8"
[tool.poetry.dev-dependencies]
[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"
Do you confirm generation? (yes/no) [yes]
Poetry is a relatively new dependency manager which takes advantage of the new pyproject.toml
file (#PEP-518
& #PEP-517
)
Poetry differentiates between dev-dependencies and runtime-dependencies
$ poetry add pandas
$ poetry add --dev pytest
pyproject.toml
and poetry.lock
pyproject.toml
poetry.lock
poetry install
will install your pyproject.toml
edit
mode (-e .
) automaticallyWe want to be able to reference our data folders - so let’s have a config file to import those paths from
import pathlib
# This depends on where the config file is
ROOT_DIR = pathlib.Path(__file__).parents[2]
DATA_DIR = ROOT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
INTERIM_DIR = DATA_DIR / "interim"
PROCESSED_DIR = DATA_DIR / "processed"
Sphinx is a documentation compiler - takes your files and documents and compiles it into a target - usually HTML
The cookiecutter comes with a basic sphinx setup in the docs
folder.
make html
./_build/html/index.html
.rst
files in the docs
folder to the htmlWe can configure sphinx in the conf.py
file in docs.
We can change the theme by finding the html_theme
variable and setting it to the theme we want.
Sphinx uses reStructuredText
as it’s markup language.
Read the basics here:
https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html
Sphinx and .rst has many directives
that let sphinx know you want to do something.
A directive looks like this:
.. my_directive::
:arg1: 1
:arg2: 2
Content given to the directive
The first directive we will be looking at is toctree
toctree
defines what .rst files are included in the build and should be present somewhere on the index.rst
It has a number of options
Generally, we can list the names of the files we want to include as content
features.rst
We can refer to other sections or definitions using the specific role
If we have a label somewhere
.. _my_label:
We can create a link to it
This refers to :ref:`my_label`
(note the backticks)
If we have defined a class documentation, for example, we could link to it
:class:`Foo`
Sphinx has a number of extensions we can use - some built-in and some we can install
These are configured in the conf.py
file and there are a number of useful ones we should set up
Extracts documentation from your code docstrings
This will import the documentation for the my_package.my_module
module
.. automodule:: my_package.my_module
:members:
We recommend writing docstrings in the “Numpy” docstyle
https://numpydoc.readthedocs.io/en/latest/format.html
This extensions lets sphinx parse this style
Allows for writing LaTex math equations and have them render properly
Lets us refer to other projects that use sphinx in our documentation
Need to add a section to conf.py
mapping a name to the external project
intersphinx_mapping = {
"pd": ("https://pandas.pydata.org/pandas-docs/stable/", None),
"sklearn": ("https://scikit-learn.org/stable/", None),
}
Now I can refer to the pandas documentation with :class:pd.DataFrame
and sphinx will auto-link to the pandas documentation
Test the codesnippets in your documentation. This adds some new directives, chief among them doctest
.. doctest::
>>> print(2 + 2)
4
Before we can do EDA - we should download the data and explore it
We want the detailed listing data for Copenhagen - find it here: http://insideairbnb.com/get-the-data.html
The data is updated monthly - write a function to download the data given a day, month and year
Spend some time exploring the data - Write it down in the sphinx documentation!
We are working with a file-based dataset, so we should use a FileDataset
to organize our dataloading.
We can also extend the FileDataset to handle the downloading and preprocessing of the data.
Identifying the correct datatypes and passing them to pd.read_csv
can save a lot of time and memory.
Fileformats such as parquet will store the datatypes as metadata, but CSV’s are text.
Remember, categorical and string are both datatypes in pandas! Check the documentation !
During the EDA process, you should have been thinking about potential preprocessing steps that need to happen.
You should also have some ideas for feature engineering that could be beneficial.
Now that we have some preprocessing code and data loading code, we should start writing tests
We use pytest - a common choice in modern python
Pytest let’s us write simple asserts in our test functions and pytest will handle running the tests
pytest looks for any functions named test_*
in files named test_*
and run those
We need to add our test functions to the tests
folder in a file named test_add.py
so pytest will find them
def my_add_function(a, b):
return a + b
def test_add():
result = my_add_function(2, 2)
assert result == 2
To run pytest at the command line, just tell it where the tests are
$ pytest ./tests
pytest has a concept of parametrization that allows us to test many inputs at once
@pytest.mark.parametrize("a, b, expected", [
(1, 2, 3),
(2, 2, 4),
(-2, 2, 0)
])
def test_add(a, b):
result = my_add_function(a, b)
assert result == expected
# This will run before every test that uses this fixture
@pytest.fixture()
def my_df():
return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
# my_df is now the return value of the fixture
def test_df_func_works(my_df):
result = my_df_func(my_df)
expected = pd.DataFrame({"a": 6, "b": 15})
# Pandas has some testing functions to help assert
pd.testing.assert_frame_equal(result, expected)
A good test pattern is writing a test class for a given piece of functionality
That class can then contain the test code and fixtures related to that functionality
import pytest
import pandas as pd
class TestDataPipe:
# scope defines how often the function is rerun
@pytest.fixture(scope="class")
def df(self):
return pd.DataFrame({"a": [1, 2, 3], "b": [2, 4, 6]})
# Scope = "class" means once per class
@pytest.fixture(scope="class")
def transformed(self, df):
return pipe(df)
# Test one thing per test
def test_adds_col_b_correctly(self, transformed):
assert df.b == 12
# If it fails, we know exactly what is wrong
def test_has_correct_name(self, transformed):
assert df.b.name == "sum"
Write some tests!
Coverage shows the percentage of lines that have been executed by a test.
It indicates areas that have not had tests written and where we should focus our attention
To add coverage, we can use a pytest plugin called pytest-cov
$ poetry add --dev pytest-cov
Now we can get test coverage
$ pytest --cov ./tests
pre-commit hooks
Must be named .pre-commit-config.yaml
repos:
# Where to find the hooks
- repo: https://github.com/pre-commit/pre-commit-hooks
# What version
rev: v2.3.0
# Which hooks to use in that repo
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 19.3b0
hooks:
- id: black
Install the hooks
$ pre-commit install
All the checks will now run on every git commit on the changed files
We can also run on all our files - useful for CI
$ pre-commit run --all-files
Tox is designed as a testrunning framework, but also works as a generic command runner
tox.ini
fileTox can run arbitrary commands with separated dependencies.
This is where we define our environments
[tox]
envlist = what envs will run if I just write `tox` without args
[testenv]
# A special env - will be run if we specify a python version
# Example: `tox -e py38` will run `testenv` with python 3.8
# `py` will run this env with just `python`
[testenv:name_of_env]
# The rest of the envs have a given name
deps = List of things to pip install
commands = command to run
skip_install = whether or not to skip installing your package
[tox]
envlist = black
[testenv:black]
skip_install = True
deps = black
commands = black .
passenv
or setenv
if you need to keep some-e
e.g tox -e black
Let’s set up our tox file to run our tests and some utilities
Since tox is a very common framework, many tools will look for configuration in the tox.ini
For details, check the toolings documentation (ex. pytest )
This lets us have one file with all our configuration in it, instead of one per tool
Let’s look at some common recommendations for configuration
tests
folder for tests with testpaths
addopts
adds commandline flags automatically to the pytest
command (list here
)-ra
means “generate short report of all tests that didn’t pass”-l
means “show local variables in the failed tests”[pytest]
addopts = -ral
testpaths = tests
filterwarnings =
once::Warning
If you have implemented pytest + coverage in your tox file, you might have noticed that the test paths look a bit strange.
Tox runs your code in a virtualenvironment, so the path to the code is inside that virtualenv!
We configure coverage to treat tox paths the same as our src
[coverage:paths]
source =
src
.tox/*/site-packages
ignore = <error code>
- Only if you really need to!black
uses 88.tox
or venv
- if you run flake8 it will go through everything[flake8]
max-line-length = 100
exclude =
.git
build
dist
notebooks
__pycache__
.tox
.pytest_cache
*.egg-info
Writing a custom transformer is not hard - we just need to follow the scikit-learn recipe!
class MyCustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self, parameter_1=1, parameter_2=2):
self.parameter_1 = parameter_1
self.parameter_2 = parameter_2
BaseEstimator
and TransformerMixin
__init__
should only set parameters def fit(self, x, y = None):
validate_data(x)
self.learned_parameter_ = my_func(x)
return self
X
, but some need y
- default to None def transform(self, x, y = None):
# We often want to make sure we don't modify the original data
x_ = x.copy()
# Transform the data using learned parameters
new_data = my_transform_func(x, self.learned_parameter_)
return new_data
x
directlyTraining a model in Jupyter is useful, but we often want to be able to train a model from the command line
Then we can automate the retraining pretty easily
We need a CLI!
Click is a great package for adding a CLI to our model - see the docs
import click
@click.command()
@click.option("--cv", default=10, type=int, help="Number of cross-validations to do")
def train(cv: int):
click.secho(f"Training with {cv} splits...", fg="green")
result = model.score_estimator(dataset, cv=cv)
click.echo(result)
if __name__ == "__main__":
train()
Now we can call it from the command line - assuming it’s in a file called main.py
$ python main.py --cv 5
Training with 5 splits...
<Result metrics={"accuracy": .98}>
We can create a “proper” CLI by adding a line to pyproject.toml
[tool.poetry.scripts]
my_cli = 'my_model.main:train'
After running poetry install
we can now do
$ my_cli --cv 5
Training with 5 splits...
<Result metrics={"accuracy": .98}>
If we want to have subcommands - e.g. train, publish, test etc we need a click.group
import click
@click.group()
def cli():
pass
@cli.command()
def train():
# train code
@cli.command()
def publish():
# Publish code
if __name__ == "__main__":
cli()
Now we can do something like this.
$ my_cli train --cv 5
// This is Scala code
var s: String = "abc"
// This will error
s = 1
dynamically typed
a = "abc"
a = 1
a = "abc"
a = 1
print(a.upper())
These errors will throw an exception at runtime
typing
module, allowing us to annotate
our typesmypy
is a CLI tool that verifies if there are any typing issuesa: str = "abc"
b: int = 1
def my_func(a: str, b: int) -> str:
...
from typing import List, Tuple
def process_ints(list_of_ints: List[int]) -> Tuple[int, int]:
...
class A:
...
class B:
....
def process_a_and_b(a: A, b: B) -> int
...
We can Union types to say “either or”
from typing import Union
def handle_data(c: Union[pd.DataFrame, pd.Series]):
...
Ducktyping can be implemented with a Protocol
from typing import Protocol
class GreetType(Protocol):
def greet(self) -> None
... # This is a literal ellipsis -> "..."
class Greeter:
def greet():
print("hello")
# Accepts any type that has a `greet` method
def greet_everyone(to_greet: List[GreetType]):
...
There are number of built-in Protocols - see them here