Setting up an ML Project

The Project

The Scenario

  • We are a new Copenhagen-based startup, homehelper.dk
  • We are a fully managed AirBnB service for people who want to rent their apartment on AirBnB
  • We offer cleaning services, pictures, key exchange and everything else a potential renter might need

The Product

We want to build a feature where a potential renter can answer a few questions about their apartment and get an indication of how much money they can expect to make (minus our fee, of course!)

The data

We are a startup, so we don’t have much data. Luckily, we found a website that scrapes AirBnB data!

http://insideairbnb.com/

Starting a new project

Install pipx

https://github.com/pipxproject/pipx

use pipx to install cookiecutter

$ pipx install cookiecutter

Cookiecutter Data Science

We are going to base our project on the Data Science cookiecutter

https://drivendata.github.io/cookiecutter-data-science/

$ cookiecutter https://github.com/drivendata/cookiecutter-data-science

project_name [project_name]: price_forecaster
repo_name [price_forecaster]:
author_name [Your name (or your organization/company/team)]: Anders Bogsnes
description [A short description of the project.]: Predicting Airbnb Prices
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: 1

The structure

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creators initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

A few changes

The rest is up to you to decide as we work through the project

Setup our src

$ cd price_forecaster/src
$ mkdir price_forecaster
# Ignore the error - everything should have worked
$ mv * price_forecaster
# Get back to our project root
$ cd ..

Poetry + pyenv for package management

Install poetry

https://python-poetry.org/docs

Install pyenv

https://github.com/pyenv/pyenv-installer

Setup your environment

$ pyenv install 3.8.5

$ pyenv virtualenv 3.8.5 price_forecaster

$ pyenv local price_forecaster
$ poetry init
Package name [price_forecaster]:  
Version [0.1.0]:  
Description []:  Airbnb price forecaster
Author [Anders Bogsnes <andersbogsnes@gmail.com>, n to skip]:  
License []:  MIT
Compatible Python versions [^3.8]:  

Would you like to define your main dependencies interactively? (yes/no) [yes] no
Would you like to define your development dependencies interactively? (yes/no) [yes] no
Generated file

[tool.poetry]
name = "price_forecaster"
version = "0.1.0"
description = "Airbnb price forecaster"
authors = ["Anders Bogsnes <andersbogsnes@gmail.com>"]
license = "MIT"

[tool.poetry.dependencies]
python = "^3.8"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"


Do you confirm generation? (yes/no) [yes] 

Poetry

Poetry is a relatively new dependency manager which takes advantage of the new pyproject.toml file (#PEP-518 & #PEP-517 )

Project deps vs Dev-deps

Poetry differentiates between dev-dependencies and runtime-dependencies

$ poetry add pandas

$ poetry add --dev pytest

Lockfile

  • Dependencies are split into pyproject.toml and poetry.lock
  • Your direct dependencies are saved in pyproject.toml
  • The exact state of your environment is saved in poetry.lock

Install

  • poetry install will install your pyproject.toml
  • It also installs your package in edit mode (-e .) automatically

Envs

  • Poetry will always install in a virtualenv
  • If you are already in one, it will install there
  • If you are not, it will automatically generate one and install there

Assignment

  • Setup your project with cookiecutter, poetry and git
  • Create a gitlab repo for your project

Create a configuration file

We want to be able to reference our data folders - so let’s have a config file to import those paths from

import pathlib

# This depends on where the config file is
ROOT_DIR = pathlib.Path(__file__).parents[2]
DATA_DIR = ROOT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
INTERIM_DIR = DATA_DIR / "interim"
PROCESSED_DIR = DATA_DIR / "processed"

Start documenting

Sphinx

Sphinx is a documentation compiler - takes your files and documents and compiles it into a target - usually HTML

Exploring Sphinx

The cookiecutter comes with a basic sphinx setup in the docs folder.

  • Install sphinx as a dev-dependency
  • navigate to docs folder
  • run make html
  • Open ./_build/html/index.html
  • Compare the .rst files in the docs folder to the html

Change the theme

We can configure sphinx in the conf.py file in docs.

We can change the theme by finding the html_theme variable and setting it to the theme we want.

Exercise

.rst syntax

Sphinx uses reStructuredText as it’s markup language.

Read the basics here:

https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html

Directives

Sphinx and .rst has many directives that let sphinx know you want to do something.

A directive looks like this:

.. my_directive::
    :arg1: 1
    :arg2: 2

    Content given to the directive

toctree

  • The first directive we will be looking at is toctree

  • toctree defines what .rst files are included in the build and should be present somewhere on the index.rst

  • It has a number of options

  • Generally, we can list the names of the files we want to include as content

Exercise

  • Add a new .rst file features.rst
  • Describe a feature
  • Add features to the toc
  • Rebuild the documentation
  • Read about the other directives

Linking

We can refer to other sections or definitions using the specific role

If we have a label somewhere

.. _my_label:

We can create a link to it

This refers to :ref:`my_label`

(note the backticks)

If we have defined a class documentation, for example, we could link to it

:class:`Foo`

Extensions

Sphinx has a number of extensions we can use - some built-in and some we can install

These are configured in the conf.py file and there are a number of useful ones we should set up

sphinx.ext.autodoc

Extracts documentation from your code docstrings

Usage

This will import the documentation for the my_package.my_module module

.. automodule:: my_package.my_module
    :members:

sphinx.ext.napoleon

We recommend writing docstrings in the “Numpy” docstyle

https://numpydoc.readthedocs.io/en/latest/format.html

This extensions lets sphinx parse this style

sphinx.ext.mathjax

Allows for writing LaTex math equations and have them render properly

spinx.ext.intersphinx

Lets us refer to other projects that use sphinx in our documentation

Need to add a section to conf.py mapping a name to the external project

intersphinx_mapping = {
    "pd": ("https://pandas.pydata.org/pandas-docs/stable/", None),
    "sklearn": ("https://scikit-learn.org/stable/", None),
}

Now I can refer to the pandas documentation with :class:pd.DataFrame and sphinx will auto-link to the pandas documentation

sphinx.ext.doctest

Test the codesnippets in your documentation. This adds some new directives, chief among them doctest

.. doctest::

    >>> print(2 + 2)
    4

The Data

Download the data

Before we can do EDA - we should download the data and explore it

We want the detailed listing data for Copenhagen - find it here: http://insideairbnb.com/get-the-data.html

Exercise

The data is updated monthly - write a function to download the data given a day, month and year

(A) Solution