We want to build a feature where a potential renter can answer a few questions about their apartment and get an indication of how much money they can expect to make (minus our fee, of course!)
We are a startup, so we don’t have much data. Luckily, we found a website that scrapes AirBnB data!
use pipx to install cookiecutter
$ pipx install cookiecutter
We are going to base our project on the Data Science cookiecutter
$ cookiecutter https://github.com/drivendata/cookiecutter-data-science
project_name [project_name]: price_forecaster
repo_name [price_forecaster]:
author_name [Your name (or your organization/company/team)]: Anders Bogsnes
description [A short description of the project.]: Predicting Airbnb Prices
Select open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 [1]: 1
s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:
aws_profile [default]:
Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: 1
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creators initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
src/<package_name>
layout (https://hynek.me/articles/testing-packaging/#src)tests
directory to put our tests inThe rest is up to you to decide as we work through the project
$ cd price_forecaster/src
$ mkdir price_forecaster
# Ignore the error - everything should have worked
$ mv * price_forecaster
# Get back to our project root
$ cd ..
$ pyenv install 3.8.5
$ pyenv virtualenv 3.8.5 price_forecaster
$ pyenv local price_forecaster
$ poetry init
Package name [price_forecaster]:
Version [0.1.0]:
Description []: Airbnb price forecaster
Author [Anders Bogsnes <andersbogsnes@gmail.com>, n to skip]:
License []: MIT
Compatible Python versions [^3.8]:
Would you like to define your main dependencies interactively? (yes/no) [yes] no
Would you like to define your development dependencies interactively? (yes/no) [yes] no
Generated file
[tool.poetry]
name = "price_forecaster"
version = "0.1.0"
description = "Airbnb price forecaster"
authors = ["Anders Bogsnes <andersbogsnes@gmail.com>"]
license = "MIT"
[tool.poetry.dependencies]
python = "^3.8"
[tool.poetry.dev-dependencies]
[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"
Do you confirm generation? (yes/no) [yes]
Poetry is a relatively new dependency manager which takes advantage of the new pyproject.toml
file (#PEP-518
& #PEP-517
)
Poetry differentiates between dev-dependencies and runtime-dependencies
$ poetry add pandas
$ poetry add --dev pytest
pyproject.toml
and poetry.lock
pyproject.toml
poetry.lock
poetry install
will install your pyproject.toml
edit
mode (-e .
) automaticallyWe want to be able to reference our data folders - so let’s have a config file to import those paths from
import pathlib
# This depends on where the config file is
ROOT_DIR = pathlib.Path(__file__).parents[2]
DATA_DIR = ROOT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
INTERIM_DIR = DATA_DIR / "interim"
PROCESSED_DIR = DATA_DIR / "processed"
Sphinx is a documentation compiler - takes your files and documents and compiles it into a target - usually HTML
The cookiecutter comes with a basic sphinx setup in the docs
folder.
make html
./_build/html/index.html
.rst
files in the docs
folder to the htmlWe can configure sphinx in the conf.py
file in docs.
We can change the theme by finding the html_theme
variable and setting it to the theme we want.
Sphinx uses reStructuredText
as it’s markup language.
Read the basics here:
https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html
Sphinx and .rst has many directives
that let sphinx know you want to do something.
A directive looks like this:
.. my_directive::
:arg1: 1
:arg2: 2
Content given to the directive
The first directive we will be looking at is toctree
toctree
defines what .rst files are included in the build and should be present somewhere on the index.rst
It has a number of options
Generally, we can list the names of the files we want to include as content
features.rst
We can refer to other sections or definitions using the specific role
If we have a label somewhere
.. _my_label:
We can create a link to it
This refers to :ref:`my_label`
(note the backticks)
If we have defined a class documentation, for example, we could link to it
:class:`Foo`
Sphinx has a number of extensions we can use - some built-in and some we can install
These are configured in the conf.py
file and there are a number of useful ones we should set up
Extracts documentation from your code docstrings
This will import the documentation for the my_package.my_module
module
.. automodule:: my_package.my_module
:members:
We recommend writing docstrings in the “Numpy” docstyle
https://numpydoc.readthedocs.io/en/latest/format.html
This extensions lets sphinx parse this style
Allows for writing LaTex math equations and have them render properly
Lets us refer to other projects that use sphinx in our documentation
Need to add a section to conf.py
mapping a name to the external project
intersphinx_mapping = {
"pd": ("https://pandas.pydata.org/pandas-docs/stable/", None),
"sklearn": ("https://scikit-learn.org/stable/", None),
}
Now I can refer to the pandas documentation with :class:pd.DataFrame
and sphinx will auto-link to the pandas documentation
Test the codesnippets in your documentation. This adds some new directives, chief among them doctest
.. doctest::
>>> print(2 + 2)
4
Before we can do EDA - we should download the data and explore it
We want the detailed listing data for Copenhagen - find it here: http://insideairbnb.com/get-the-data.html
The data is updated monthly - write a function to download the data given a day, month and year