Setting up an ML Project

The Project

The Scenario

We are a new Copenhagen-based startup, homehelper.dk
We are a fully managed AirBnB service for people who want to rent their apartment on AirBnB
We offer cleaning services, pictures, key exchange and everything else a potential renter might need

The Product

We want to build a feature where a potential renter can answer a few questions about their apartment and get an indication of how much money they can expect to make (minus our fee, of course!)

The data

We are a startup, so we don’t have much data. Luckily, we found a website that scrapes AirBnB data!

http://insideairbnb.com/

Starting a new project

Install pipx

https://github.com/pipxproject/pipx

use pipx to install cookiecutter

$ pipx install cookiecutter

Cookiecutter Data Science

We are going to base our project on the Data Science cookiecutter

https://drivendata.github.io/cookiecutter-data-science/

$ cookiecutter https://github.com/drivendata/cookiecutter-data-science

project_name [project_name]: price_forecaster
repo_name [price_forecaster]:
author_name [Your name (or your organization/company/team)]: Anders Bogsnes
description [A short description of the project.]: Predicting Airbnb Prices
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: 1

The structure

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creators initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

A few changes

We want a src/<package_name> layout (https://hynek.me/articles/testing-packaging/#src)
We want to use poetry instead of pip
We want a centralized configuration file in our package
We want a tests directory to put our tests in

The rest is up to you to decide as we work through the project

Setup our src

$ cd price_forecaster/src
$ mkdir price_forecaster
# Ignore the error - everything should have worked
$ mv * price_forecaster
# Get back to our project root
$ cd ..

Poetry + pyenv for package management

Install poetry

https://python-poetry.org/docs

Install pyenv

https://github.com/pyenv/pyenv-installer

Setup your environment

$ pyenv install 3.8.5

$ pyenv virtualenv 3.8.5 price_forecaster

$ pyenv local price_forecaster

$ poetry init
Package name [price_forecaster]:  
Version [0.1.0]:  
Description []:  Airbnb price forecaster
Author [Anders Bogsnes <andersbogsnes@gmail.com>, n to skip]:  
License []:  MIT
Compatible Python versions [^3.8]:  

Would you like to define your main dependencies interactively? (yes/no) [yes] no
Would you like to define your development dependencies interactively? (yes/no) [yes] no
Generated file

[tool.poetry]
name = "price_forecaster"
version = "0.1.0"
description = "Airbnb price forecaster"
authors = ["Anders Bogsnes <andersbogsnes@gmail.com>"]
license = "MIT"

[tool.poetry.dependencies]
python = "^3.8"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"


Do you confirm generation? (yes/no) [yes]

Poetry

Poetry is a relatively new dependency manager which takes advantage of the new pyproject.toml file (#PEP-518 & #PEP-517 )

Project deps vs Dev-deps

Poetry differentiates between dev-dependencies and runtime-dependencies

$ poetry add pandas

$ poetry add --dev pytest

Lockfile

Dependencies are split into pyproject.toml and poetry.lock
Your direct dependencies are saved in pyproject.toml
The exact state of your environment is saved in poetry.lock

Install

poetry install will install your pyproject.toml
It also installs your package in edit mode (-e .) automatically

Envs

Poetry will always install in a virtualenv
If you are already in one, it will install there
If you are not, it will automatically generate one and install there

Assignment

Setup your project with cookiecutter, poetry and git
Create a gitlab repo for your project

Create a configuration file

We want to be able to reference our data folders - so let’s have a config file to import those paths from

import pathlib

# This depends on where the config file is
ROOT_DIR = pathlib.Path(__file__).parents[2]
DATA_DIR = ROOT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
INTERIM_DIR = DATA_DIR / "interim"
PROCESSED_DIR = DATA_DIR / "processed"

Start documenting

Sphinx

Sphinx is a documentation compiler - takes your files and documents and compiles it into a target - usually HTML

Exploring Sphinx

The cookiecutter comes with a basic sphinx setup in the docs folder.

Install sphinx as a dev-dependency
navigate to docs folder
run make html
Open ./_build/html/index.html
Compare the .rst files in the docs folder to the html

Change the theme

We can configure sphinx in the conf.py file in docs.

We can change the theme by finding the html_theme variable and setting it to the theme we want.

Exercise

Find a theme you like here -> https://sphinx-themes.org/
Some themes need to be installed
Set the theme and rebuild the docs

.rst syntax

Sphinx uses reStructuredText as it’s markup language.

Read the basics here:

https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html

Directives

Sphinx and .rst has many directives that let sphinx know you want to do something.

A directive looks like this:

.. my_directive::
    :arg1: 1
    :arg2: 2

    Content given to the directive

toctree

The first directive we will be looking at is toctree
toctree defines what .rst files are included in the build and should be present somewhere on the index.rst
It has a number of options
Generally, we can list the names of the files we want to include as content

Exercise

Add a new .rst file features.rst
Describe a feature
Add features to the toc
Rebuild the documentation
Read about the other directives

Linking

We can refer to other sections or definitions using the specific role

If we have a label somewhere

.. _my_label:

We can create a link to it

This refers to :ref:`my_label`

(note the backticks)

If we have defined a class documentation, for example, we could link to it

:class:`Foo`

Extensions

Sphinx has a number of extensions we can use - some built-in and some we can install

These are configured in the conf.py file and there are a number of useful ones we should set up

sphinx.ext.autodoc

Extracts documentation from your code docstrings

Usage

This will import the documentation for the my_package.my_module module

.. automodule:: my_package.my_module
    :members:

sphinx.ext.napoleon

We recommend writing docstrings in the “Numpy” docstyle

https://numpydoc.readthedocs.io/en/latest/format.html

This extensions lets sphinx parse this style

sphinx.ext.mathjax

Allows for writing LaTex math equations and have them render properly

spinx.ext.intersphinx

Lets us refer to other projects that use sphinx in our documentation

Need to add a section to conf.py mapping a name to the external project

intersphinx_mapping = {
    "pd": ("https://pandas.pydata.org/pandas-docs/stable/", None),
    "sklearn": ("https://scikit-learn.org/stable/", None),
}

Now I can refer to the pandas documentation with :class:pd.DataFrame and sphinx will auto-link to the pandas documentation

sphinx.ext.doctest

Test the codesnippets in your documentation. This adds some new directives, chief among them doctest

.. doctest::

    >>> print(2 + 2)
    4

The Data

Download the data

Before we can do EDA - we should download the data and explore it

We want the detailed listing data for Copenhagen - find it here: http://insideairbnb.com/get-the-data.html

Exercise

The data is updated monthly - write a function to download the data given a day, month and year

(A) Solution

Explore the data

Spend some time exploring the data - Write it down in the sphinx documentation!

What features look promising?
What dtypes are the features?
Is there missing data?
Is it high cardinality?
Ideas for preprocessing?
Hypotheses?
What can we throw out?

Preprocessing

Building the dataset

We are working with a file-based dataset, so we should use a FileDataset to organize our dataloading.

We can also extend the FileDataset to handle the downloading and preprocessing of the data.

Identify datatypes to read

Identifying the correct datatypes and passing them to pd.read_csv can save a lot of time and memory.

Fileformats such as parquet will store the datatypes as metadata, but CSV’s are text.

Remember, categorical and string are both datatypes in pandas! Check the documentation !

Identify preprocessing steps

During the EDA process, you should have been thinking about potential preprocessing steps that need to happen.

You should also have some ideas for feature engineering that could be beneficial.

Exercise

Write a PriceForecasterDataset class inheriting from FileDataset
Extend it with the download data functionality from before
Extend it with the data preprocessing you’ve identified

Testing

Writing a test

Now that we have some preprocessing code and data loading code, we should start writing tests

Pytest

We use pytest - a common choice in modern python

Pytest let’s us write simple asserts in our test functions and pytest will handle running the tests

pytest looks for any functions named test_* in files named test_* and run those

A simple test

We need to add our test functions to the tests folder in a file named test_add.py so pytest will find them

def my_add_function(a, b):
    return a + b

def test_add():
    result = my_add_function(2, 2)
    assert result == 2

Running pytest

To run pytest at the command line, just tell it where the tests are

$ pytest ./tests

Parametrization

pytest has a concept of parametrization that allows us to test many inputs at once

@pytest.mark.parametrize("a, b, expected", [
    (1, 2, 3),
    (2, 2, 4),
    (-2, 2, 0)
])
def test_add(a, b):
    result = my_add_function(a, b)
    assert result == expected

Fixtures

A fixture is some setup code that we want to run before our test
It allows us to reuse logic across tests
It allows us to create resources, such as a database connection, and then close it when done

# This will run before every test that uses this fixture
@pytest.fixture()
def my_df():
    return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

# my_df is now the return value of the fixture
def test_df_func_works(my_df):
    result = my_df_func(my_df)
    expected = pd.DataFrame({"a": 6, "b": 15})

    # Pandas has some testing functions to help assert
    pd.testing.assert_frame_equal(result, expected)

A class testing pattern

A good test pattern is writing a test class for a given piece of functionality

That class can then contain the test code and fixtures related to that functionality

import pytest
import pandas as pd

class TestDataPipe:
    # scope defines how often the function is rerun
    @pytest.fixture(scope="class")
    def df(self):
        return pd.DataFrame({"a": [1, 2, 3], "b": [2, 4, 6]})

    # Scope = "class" means once per class
    @pytest.fixture(scope="class")
    def transformed(self, df):
        return pipe(df)

    # Test one thing per test
    def test_adds_col_b_correctly(self, transformed):
        assert df.b == 12

    # If it fails, we know exactly what is wrong
    def test_has_correct_name(self, transformed):
        assert df.b.name == "sum"

Exercise

Write some tests!

Create some mock data
transform it and verify the output
run pytest on your test code

Coverage

Coverage shows the percentage of lines that have been executed by a test.

It indicates areas that have not had tests written and where we should focus our attention

Adding coverage

To add coverage, we can use a pytest plugin called pytest-cov

$ poetry add --dev pytest-cov

Now we can get test coverage

$ pytest --cov ./tests

Pre-commit

What is it?

It lets us run quick tests everytime we make a commit by using git pre-commit hooks
It makes it easy to define and install these little scripts by writing a short config file
It ensures that linting and formatting is run continously on your codebase - no more CI failures!

Example file

Must be named .pre-commit-config.yaml

repos:
    # Where to find the hooks
-   repo: https://github.com/pre-commit/pre-commit-hooks
    # What version
    rev: v2.3.0
    # Which hooks to use in that repo
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace
-   repo: https://github.com/psf/black
    rev: 19.3b0
    hooks:
    -   id: black

Usage

Install the hooks

$ pre-commit install

All the checks will now run on every git commit on the changed files

We can also run on all our files - useful for CI

$ pre-commit run --all-files

Exercise

Check here for some out-of-the-box hooks
Check here for hooks for other tools
Add some hooks you think could be useful

Tox

What is tox

Tox is designed as a testrunning framework, but also works as a generic command runner

When we run tox

tox reads a tox.ini file
creates a virtualenv for each environment to be run
installs defined dependencies into the virtualenv, including your package
runs the defined command

Why use tox

For testing, tox runs the tests on the installed package, not your source code!
This mimics what your users will do and creates more reliable tests
Tox is also designed to test with multiple python versions

Tox can run arbitrary commands with separated dependencies.

Want to run sphinx? define an environment
Want to run black? define an environment
Want to run mypy? define an environment

The tox.ini file

This is where we define our environments


[tox]
envlist = what envs will run if I just write `tox` without args

[testenv]
# A special env - will be run if we specify a python version
# Example: `tox -e py38` will run `testenv` with python 3.8
# `py` will run this env with just `python`

[testenv:name_of_env]
# The rest of the envs have a given name
deps = List of things to pip install
commands = command to run
skip_install = whether or not to skip installing your package

Implementing a black environment

[tox]
envlist = black

[testenv:black]
skip_install = True
deps = black
commands = black .

Points of note

tox will remove all environment variables when running - use passenv or setenv if you need to keep some
tox is installing in a separate environment - you need to specify all dependencies
Run a specific environment with -e e.g tox -e black

Exercise

Let’s set up our tox file to run our tests and some utilities

Add the testing to tox
Add pre-commit
Add a linter, like flake8
BONUS POINTS: check the flake8 docs for adding some flake8 plugins

Configuration

Since tox is a very common framework, many tools will look for configuration in the tox.ini

For details, check the toolings documentation (ex. pytest )

This lets us have one file with all our configuration in it, instead of one per tool

Let’s look at some common recommendations for configuration

Pytest

Automatically look in the tests folder for tests with testpaths
addopts adds commandline flags automatically to the pytest command (list here )
- -ra means “generate short report of all tests that didn’t pass”
- -l means “show local variables in the failed tests”

[pytest]
addopts = -ral
testpaths = tests
filterwarnings =
    once::Warning

Tox + Coverage

If you have implemented pytest + coverage in your tox file, you might have noticed that the test paths look a bit strange.

Tox runs your code in a virtualenvironment, so the path to the code is inside that virtualenv!

We configure coverage to treat tox paths the same as our src

[coverage:paths]
source =
    src
    .tox/*/site-packages

Flake8

Control flake8 ignores with ignore = <error code> - Only if you really need to!
Set linelength higher than 79 - black uses 88
Exclude directories like .tox or venv - if you run flake8 it will go through everything

[flake8]
max-line-length = 100
exclude =
    .git
    build
    dist
    notebooks
    __pycache__
    .tox
    .pytest_cache
    *.egg-info

Pipelines & Transformers

Writing a custom Transformer

Writing a custom transformer is not hard - we just need to follow the scikit-learn recipe!

The template

class MyCustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, parameter_1=1, parameter_2=2):
        self.parameter_1 = parameter_1
        self.parameter_2 = parameter_2

Must inherit from BaseEstimator and TransformerMixin
The __init__ should only set parameters
It should not have any logic inside
All parameters must have default values

    def fit(self, x, y = None):
        validate_data(x)
        self.learned_parameter_ = my_func(x)
        return self

The “learning” of parameters from training data happens here
Any validation of data/parameters goes in here
Scikit-learn convention is that “learned” attributes have “_” after
Most transformers work only with X, but some need y - default to None
Returns self for chaining

    def transform(self, x, y = None):
        # We often want to make sure we don't modify the original data
        x_ = x.copy()
        # Transform the data using learned parameters
        new_data = my_transform_func(x, self.learned_parameter_)
        return new_data

Transform the data using the learned parameter
Make sure not to modify x directly

Adding a CLI (Command Line Interface)

Running our model from the command line

Training a model in Jupyter is useful, but we often want to be able to train a model from the command line

Then we can automate the retraining pretty easily

We need a CLI!

Click

Click is a great package for adding a CLI to our model - see the docs

import click

@click.command()
@click.option("--cv", default=10, type=int, help="Number of cross-validations to do")
def train(cv: int):
    click.secho(f"Training with {cv} splits...", fg="green")
    result = model.score_estimator(dataset, cv=cv)
    click.echo(result)

if __name__ == "__main__":
    train()

Now we can call it from the command line - assuming it’s in a file called main.py

$ python main.py --cv 5
Training with 5 splits...
<Result metrics={"accuracy": .98}>

We can create a “proper” CLI by adding a line to pyproject.toml

[tool.poetry.scripts]
my_cli = 'my_model.main:train'

After running poetry install we can now do

$ my_cli --cv 5
Training with 5 splits...
<Result metrics={"accuracy": .98}>

Multiple commands

If we want to have subcommands - e.g. train, publish, test etc we need a click.group

import click

@click.group()
def cli():
    pass

@cli.command()
def train():
    # train code

@cli.command()
def publish():
    # Publish code

if __name__ == "__main__":
    cli()

Now we can do something like this.

$ my_cli train --cv 5

Type Hinting

Static Typing

Other languages have a concept of static typing
When declaring a variable, we must declare a type as well and that type does not change

// This is Scala code
var s: String = "abc"
// This will error
s = 1

Dynamic Typing

Python is dynamically typed
We can reassign variables as we want

a = "abc"
a = 1

Advantages of Static typing

Static typing makes it easy to verify the correctness of our program

a = "abc"
a = 1
print(a.upper())

These errors will throw an exception at runtime

Advantages of Dynamic typing

Part of the simplicity of Python
We might not know up-front what types we can work with
“Duck typing”

The best of both worlds?

Python 3.5 introduced the typing module, allowing us to annotate our types
Annotated types have no effect on the runtime - it helps tools be able to verify your code
Pycharm uses typehints to warn you if your types are incompatible
mypy is a CLI tool that verifies if there are any typing issues

How to type

a: str = "abc"
b: int = 1

def my_func(a: str, b: int) -> str:
    ...

The typing module

from typing import List, Tuple

def process_ints(list_of_ints: List[int]) -> Tuple[int, int]:
    ...

Classes are types too


class A:
    ...

class B:
    ....

def process_a_and_b(a: A, b: B) -> int
    ...

Multiple Types

We can Union types to say “either or”

from typing import Union

def handle_data(c: Union[pd.DataFrame, pd.Series]):
    ...

Supporting Ducktyping

Ducktyping can be implemented with a Protocol

from typing import Protocol

class GreetType(Protocol):
    def greet(self) -> None
        ... # This is a literal ellipsis -> "..."


class Greeter:
    def greet():
        print("hello")

# Accepts any type that has a `greet` method
def greet_everyone(to_greet: List[GreetType]):
    ...

There are number of built-in Protocols - see them here

Implement Types

Typing helps you document your code
Typing helps catch errors before they blow up
Typing makes Pycharm smarter