Setting up an ML Project

The Project

The Scenario

  • We are a new Copenhagen-based startup, homehelper.dk
  • We are a fully managed AirBnB service for people who want to rent their apartment on AirBnB
  • We offer cleaning services, pictures, key exchange and everything else a potential renter might need

The Product

We want to build a feature where a potential renter can answer a few questions about their apartment and get an indication of how much money they can expect to make (minus our fee, of course!)

The data

We are a startup, so we don’t have much data. Luckily, we found a website that scrapes AirBnB data!

http://insideairbnb.com/

Starting a new project

Install pipx

https://github.com/pipxproject/pipx

use pipx to install cookiecutter

$ pipx install cookiecutter

Cookiecutter Data Science

We are going to base our project on the Data Science cookiecutter

https://drivendata.github.io/cookiecutter-data-science/

$ cookiecutter https://github.com/drivendata/cookiecutter-data-science

project_name [project_name]: price_forecaster
repo_name [price_forecaster]:
author_name [Your name (or your organization/company/team)]: Anders Bogsnes
description [A short description of the project.]: Predicting Airbnb Prices
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: 1

The structure

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creators initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

A few changes

The rest is up to you to decide as we work through the project

Setup our src

$ cd price_forecaster/src
$ mkdir price_forecaster
# Ignore the error - everything should have worked
$ mv * price_forecaster
# Get back to our project root
$ cd ..

Poetry + pyenv for package management

Install poetry

https://python-poetry.org/docs

Install pyenv

https://github.com/pyenv/pyenv-installer

Setup your environment

$ pyenv install 3.8.5

$ pyenv virtualenv 3.8.5 price_forecaster

$ pyenv local price_forecaster
$ poetry init
Package name [price_forecaster]:  
Version [0.1.0]:  
Description []:  Airbnb price forecaster
Author [Anders Bogsnes <andersbogsnes@gmail.com>, n to skip]:  
License []:  MIT
Compatible Python versions [^3.8]:  

Would you like to define your main dependencies interactively? (yes/no) [yes] no
Would you like to define your development dependencies interactively? (yes/no) [yes] no
Generated file

[tool.poetry]
name = "price_forecaster"
version = "0.1.0"
description = "Airbnb price forecaster"
authors = ["Anders Bogsnes <andersbogsnes@gmail.com>"]
license = "MIT"

[tool.poetry.dependencies]
python = "^3.8"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"


Do you confirm generation? (yes/no) [yes] 

Poetry

Poetry is a relatively new dependency manager which takes advantage of the new pyproject.toml file (#PEP-518 & #PEP-517 )

Project deps vs Dev-deps

Poetry differentiates between dev-dependencies and runtime-dependencies

$ poetry add pandas

$ poetry add --dev pytest

Lockfile

  • Dependencies are split into pyproject.toml and poetry.lock
  • Your direct dependencies are saved in pyproject.toml
  • The exact state of your environment is saved in poetry.lock

Install

  • poetry install will install your pyproject.toml
  • It also installs your package in edit mode (-e .) automatically

Envs

  • Poetry will always install in a virtualenv
  • If you are already in one, it will install there
  • If you are not, it will automatically generate one and install there

Assignment

  • Setup your project with cookiecutter, poetry and git
  • Create a gitlab repo for your project

Create a configuration file

We want to be able to reference our data folders - so let’s have a config file to import those paths from

import pathlib

# This depends on where the config file is
ROOT_DIR = pathlib.Path(__file__).parents[2]
DATA_DIR = ROOT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
INTERIM_DIR = DATA_DIR / "interim"
PROCESSED_DIR = DATA_DIR / "processed"

Start documenting

Sphinx

Sphinx is a documentation compiler - takes your files and documents and compiles it into a target - usually HTML

Exploring Sphinx

The cookiecutter comes with a basic sphinx setup in the docs folder.

  • Install sphinx as a dev-dependency
  • navigate to docs folder
  • run make html
  • Open ./_build/html/index.html
  • Compare the .rst files in the docs folder to the html

Change the theme

We can configure sphinx in the conf.py file in docs.

We can change the theme by finding the html_theme variable and setting it to the theme we want.

Exercise

.rst syntax

Sphinx uses reStructuredText as it’s markup language.

Read the basics here:

https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html

Directives

Sphinx and .rst has many directives that let sphinx know you want to do something.

A directive looks like this:

.. my_directive::
    :arg1: 1
    :arg2: 2

    Content given to the directive

toctree

  • The first directive we will be looking at is toctree

  • toctree defines what .rst files are included in the build and should be present somewhere on the index.rst

  • It has a number of options

  • Generally, we can list the names of the files we want to include as content

Exercise

  • Add a new .rst file features.rst
  • Describe a feature
  • Add features to the toc
  • Rebuild the documentation
  • Read about the other directives

Linking

We can refer to other sections or definitions using the specific role

If we have a label somewhere

.. _my_label:

We can create a link to it

This refers to :ref:`my_label`

(note the backticks)

If we have defined a class documentation, for example, we could link to it

:class:`Foo`

Extensions

Sphinx has a number of extensions we can use - some built-in and some we can install

These are configured in the conf.py file and there are a number of useful ones we should set up

sphinx.ext.autodoc

Extracts documentation from your code docstrings

Usage

This will import the documentation for the my_package.my_module module

.. automodule:: my_package.my_module
    :members:

sphinx.ext.napoleon

We recommend writing docstrings in the “Numpy” docstyle

https://numpydoc.readthedocs.io/en/latest/format.html

This extensions lets sphinx parse this style

sphinx.ext.mathjax

Allows for writing LaTex math equations and have them render properly

spinx.ext.intersphinx

Lets us refer to other projects that use sphinx in our documentation

Need to add a section to conf.py mapping a name to the external project

intersphinx_mapping = {
    "pd": ("https://pandas.pydata.org/pandas-docs/stable/", None),
    "sklearn": ("https://scikit-learn.org/stable/", None),
}

Now I can refer to the pandas documentation with :class:pd.DataFrame and sphinx will auto-link to the pandas documentation

sphinx.ext.doctest

Test the codesnippets in your documentation. This adds some new directives, chief among them doctest

.. doctest::

    >>> print(2 + 2)
    4

The Data

Download the data

Before we can do EDA - we should download the data and explore it

We want the detailed listing data for Copenhagen - find it here: http://insideairbnb.com/get-the-data.html

Exercise

The data is updated monthly - write a function to download the data given a day, month and year

(A) Solution

Explore the data

Spend some time exploring the data - Write it down in the sphinx documentation!

  • What features look promising?
  • What dtypes are the features?
  • Is there missing data?
  • Is it high cardinality?
  • Ideas for preprocessing?
  • Hypotheses?
  • What can we throw out?

Preprocessing

Building the dataset

We are working with a file-based dataset, so we should use a FileDataset to organize our dataloading.

We can also extend the FileDataset to handle the downloading and preprocessing of the data.

Identify datatypes to read

Identifying the correct datatypes and passing them to pd.read_csv can save a lot of time and memory.

Fileformats such as parquet will store the datatypes as metadata, but CSV’s are text.

Remember, categorical and string are both datatypes in pandas! Check the documentation !

Identify preprocessing steps

During the EDA process, you should have been thinking about potential preprocessing steps that need to happen.

You should also have some ideas for feature engineering that could be beneficial.

Exercise

  • Write a PriceForecasterDataset class inheriting from FileDataset
  • Extend it with the download data functionality from before
  • Extend it with the data preprocessing you’ve identified

Testing

Writing a test

Now that we have some preprocessing code and data loading code, we should start writing tests

Pytest

We use pytest - a common choice in modern python

Pytest let’s us write simple asserts in our test functions and pytest will handle running the tests

pytest looks for any functions named test_* in files named test_* and run those

A simple test

We need to add our test functions to the tests folder in a file named test_add.py so pytest will find them

def my_add_function(a, b):
    return a + b

def test_add():
    result = my_add_function(2, 2)
    assert result == 2

Running pytest

To run pytest at the command line, just tell it where the tests are

$ pytest ./tests

Parametrization

pytest has a concept of parametrization that allows us to test many inputs at once

@pytest.mark.parametrize("a, b, expected", [
    (1, 2, 3),
    (2, 2, 4),
    (-2, 2, 0)
])
def test_add(a, b):
    result = my_add_function(a, b)
    assert result == expected

Fixtures

  • A fixture is some setup code that we want to run before our test
  • It allows us to reuse logic across tests
  • It allows us to create resources, such as a database connection, and then close it when done
# This will run before every test that uses this fixture
@pytest.fixture()
def my_df():
    return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

# my_df is now the return value of the fixture
def test_df_func_works(my_df):
    result = my_df_func(my_df)
    expected = pd.DataFrame({"a": 6, "b": 15})

    # Pandas has some testing functions to help assert
    pd.testing.assert_frame_equal(result, expected)

A class testing pattern

A good test pattern is writing a test class for a given piece of functionality

That class can then contain the test code and fixtures related to that functionality

import pytest
import pandas as pd

class TestDataPipe:
    # scope defines how often the function is rerun
    @pytest.fixture(scope="class")
    def df(self):
        return pd.DataFrame({"a": [1, 2, 3], "b": [2, 4, 6]})

    # Scope = "class" means once per class
    @pytest.fixture(scope="class")
    def transformed(self, df):
        return pipe(df)

    # Test one thing per test
    def test_adds_col_b_correctly(self, transformed):
        assert df.b == 12

    # If it fails, we know exactly what is wrong
    def test_has_correct_name(self, transformed):
        assert df.b.name == "sum"

Exercise

Write some tests!

  • Create some mock data
  • transform it and verify the output
  • run pytest on your test code

Coverage

Coverage shows the percentage of lines that have been executed by a test.

It indicates areas that have not had tests written and where we should focus our attention

Adding coverage

To add coverage, we can use a pytest plugin called pytest-cov

$ poetry add --dev pytest-cov

Now we can get test coverage

$ pytest --cov ./tests

Pre-commit

What is it?

  • It lets us run quick tests everytime we make a commit by using git pre-commit hooks
  • It makes it easy to define and install these little scripts by writing a short config file
  • It ensures that linting and formatting is run continously on your codebase - no more CI failures!

Example file

Must be named .pre-commit-config.yaml

repos:
    # Where to find the hooks
-   repo: https://github.com/pre-commit/pre-commit-hooks
    # What version
    rev: v2.3.0
    # Which hooks to use in that repo
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace
-   repo: https://github.com/psf/black
    rev: 19.3b0
    hooks:
    -   id: black

Usage

Install the hooks

$ pre-commit install

All the checks will now run on every git commit on the changed files

We can also run on all our files - useful for CI

$ pre-commit run --all-files

Exercise

  • Check here for some out-of-the-box hooks
  • Check here for hooks for other tools
  • Add some hooks you think could be useful

Tox

What is tox

Tox is designed as a testrunning framework, but also works as a generic command runner

When we run tox

  • tox reads a tox.ini file
  • creates a virtualenv for each environment to be run
  • installs defined dependencies into the virtualenv, including your package
  • runs the defined command

Why use tox

  • For testing, tox runs the tests on the installed package, not your source code!
  • This mimics what your users will do and creates more reliable tests
  • Tox is also designed to test with multiple python versions

Tox can run arbitrary commands with separated dependencies.

  • Want to run sphinx? define an environment
  • Want to run black? define an environment
  • Want to run mypy? define an environment

The tox.ini file

This is where we define our environments


[tox]
envlist = what envs will run if I just write `tox` without args

[testenv]
# A special env - will be run if we specify a python version
# Example: `tox -e py38` will run `testenv` with python 3.8
# `py` will run this env with just `python`

[testenv:name_of_env]
# The rest of the envs have a given name
deps = List of things to pip install
commands = command to run
skip_install = whether or not to skip installing your package

Implementing a black environment

[tox]
envlist = black

[testenv:black]
skip_install = True
deps = black
commands = black .

Points of note

  • tox will remove all environment variables when running - use passenv or setenv if you need to keep some
  • tox is installing in a separate environment - you need to specify all dependencies
  • Run a specific environment with -e e.g tox -e black

Exercise

Let’s set up our tox file to run our tests and some utilities

  • Add the testing to tox
  • Add pre-commit
  • Add a linter, like flake8
  • BONUS POINTS: check the flake8 docs for adding some flake8 plugins

Configuration

Since tox is a very common framework, many tools will look for configuration in the tox.ini

For details, check the toolings documentation (ex. pytest )

This lets us have one file with all our configuration in it, instead of one per tool

Let’s look at some common recommendations for configuration

Pytest

  • Automatically look in the tests folder for tests with testpaths
  • addopts adds commandline flags automatically to the pytest command (list here )
    • -ra means “generate short report of all tests that didn’t pass”
    • -l means “show local variables in the failed tests”
[pytest]
addopts = -ral
testpaths = tests
filterwarnings =
    once::Warning

Tox + Coverage

If you have implemented pytest + coverage in your tox file, you might have noticed that the test paths look a bit strange.

Tox runs your code in a virtualenvironment, so the path to the code is inside that virtualenv!

We configure coverage to treat tox paths the same as our src

[coverage:paths]
source =
    src
    .tox/*/site-packages

Flake8

  • Control flake8 ignores with ignore = <error code> - Only if you really need to!
  • Set linelength higher than 79 - black uses 88
  • Exclude directories like .tox or venv - if you run flake8 it will go through everything
[flake8]
max-line-length = 100
exclude =
    .git
    build
    dist
    notebooks
    __pycache__
    .tox
    .pytest_cache
    *.egg-info

Pipelines & Transformers

Writing a custom Transformer

Writing a custom transformer is not hard - we just need to follow the scikit-learn recipe!

The template

class MyCustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, parameter_1=1, parameter_2=2):
        self.parameter_1 = parameter_1
        self.parameter_2 = parameter_2
  • Must inherit from BaseEstimator and TransformerMixin
  • The __init__ should only set parameters
  • It should not have any logic inside
  • All parameters must have default values
    def fit(self, x, y = None):
        validate_data(x)
        self.learned_parameter_ = my_func(x)
        return self
  • The “learning” of parameters from training data happens here
  • Any validation of data/parameters goes in here
  • Scikit-learn convention is that “learned” attributes have “_” after
  • Most transformers work only with X, but some need y - default to None
  • Returns self for chaining
    def transform(self, x, y = None):
        # We often want to make sure we don't modify the original data
        x_ = x.copy()
        # Transform the data using learned parameters
        new_data = my_transform_func(x, self.learned_parameter_)
        return new_data
  • Transform the data using the learned parameter
  • Make sure not to modify x directly

Adding a CLI (Command Line Interface)

Running our model from the command line

Training a model in Jupyter is useful, but we often want to be able to train a model from the command line

Then we can automate the retraining pretty easily

We need a CLI!

Click

Click is a great package for adding a CLI to our model - see the docs

import click

@click.command()
@click.option("--cv", default=10, type=int, help="Number of cross-validations to do")
def train(cv: int):
    click.secho(f"Training with {cv} splits...", fg="green")
    result = model.score_estimator(dataset, cv=cv)
    click.echo(result)

if __name__ == "__main__":
    train()

Now we can call it from the command line - assuming it’s in a file called main.py

$ python main.py --cv 5
Training with 5 splits...
<Result metrics={"accuracy": .98}>

We can create a “proper” CLI by adding a line to pyproject.toml

[tool.poetry.scripts]
my_cli = 'my_model.main:train'

After running poetry install we can now do

$ my_cli --cv 5
Training with 5 splits...
<Result metrics={"accuracy": .98}>

Multiple commands

If we want to have subcommands - e.g. train, publish, test etc we need a click.group

import click

@click.group()
def cli():
    pass

@cli.command()
def train():
    # train code

@cli.command()
def publish():
    # Publish code

if __name__ == "__main__":
    cli()

Now we can do something like this.

$ my_cli train --cv 5

Type Hinting

Static Typing

  • Other languages have a concept of static typing
  • When declaring a variable, we must declare a type as well and that type does not change
// This is Scala code
var s: String = "abc"
// This will error
s = 1

Dynamic Typing

  • Python is dynamically typed
  • We can reassign variables as we want
a = "abc"
a = 1

Advantages of Static typing

  • Static typing makes it easy to verify the correctness of our program
a = "abc"
a = 1
print(a.upper())

These errors will throw an exception at runtime

Advantages of Dynamic typing

  • Part of the simplicity of Python
  • We might not know up-front what types we can work with
  • “Duck typing”

The best of both worlds?

  • Python 3.5 introduced the typing module, allowing us to annotate our types
  • Annotated types have no effect on the runtime - it helps tools be able to verify your code
  • Pycharm uses typehints to warn you if your types are incompatible
  • mypy is a CLI tool that verifies if there are any typing issues

How to type

a: str = "abc"
b: int = 1

def my_func(a: str, b: int) -> str:
    ...

The typing module

from typing import List, Tuple

def process_ints(list_of_ints: List[int]) -> Tuple[int, int]:
    ...

Classes are types too


class A:
    ...

class B:
    ....

def process_a_and_b(a: A, b: B) -> int
    ...

Multiple Types

We can Union types to say “either or”

from typing import Union

def handle_data(c: Union[pd.DataFrame, pd.Series]):
    ...

Supporting Ducktyping

Ducktyping can be implemented with a Protocol

from typing import Protocol

class GreetType(Protocol):
    def greet(self) -> None
        ... # This is a literal ellipsis -> "..."


class Greeter:
    def greet():
        print("hello")

# Accepts any type that has a `greet` method
def greet_everyone(to_greet: List[GreetType]):
    ...

There are number of built-in Protocols - see them here

Implement Types

  • Typing helps you document your code
  • Typing helps catch errors before they blow up
  • Typing makes Pycharm smarter