As algorithmic traders, we need a lot of data to test and optimize our strategy ideas. Over time, the amount of data adds up and the search for a reliable, efficient and easy-to-use storage solution begins.

When I first confronted this issue, my immediate go-to solution was to use a database like MySQL or PostgreSQL. But as a Python developer and a heavy Pandas user, I was hoping for a more "Pythonic" solution. I was looking for a solution that will allow me to store Pandas dataframes fast and fetch them even faster -- all while keeping the data itself as *portable* as possible.

The closest solution I found was the very impressive [Arctic library](https://github.com/manahl/arctic) by [Man AHL](http://www.ahl.com/), but it required me to run a MongoDB server, and the data wasn't very easily portable.

I finally ended up with a solution that may raise some eyebrows, but I feel that it works far better in the context of small to mid-sized trading operation.

The solution I chose was ***not using a database at all***. 😲

You see, databases are designed to do a wide variety of things with data, many of which require checks and safety mechanisms to be implemented to do these things effectively and safely. This actually hurt the performance of what you are trying to do.

On the other hand, market timeseries data is unique, both in how it is used and how it is stored. The user **drobertson** explained it best in this [Stack Overflow comment](https://quant.stackexchange.com/questions/29572/building-financial-data-time-series-database-from-scratch):

> Market time series data is stored in a completely different way. In fact, I would say it is prepared rather than stored. Each data item only needs to be written once and after that never needs to be modified or changed. Data items can be written sequentially, there is no need to insert anything in the middle. It needs no ACID functionality at all. They have little to no references out to any other data. The time series is effectively its own thing.

So... the solution was to use a flat-file database, using a binary file format where I can store Pandas dataframes, and that supports compression to save on disk space and make portability easier. Luckily, the [Parquet file format](http://parquet.apache.org/) seemed to fit the bill just right :)

The next thing was to write a tool that will allow me to read and write such files in a "pythonic" way.

### Enters PyStore...

I created [PyStore](https://github.com/ranaroussi/pystore) to help me store and retrieve Pandas dataframes, and while it can store any Pandas object, it was designed with storing timeseries data in mind.

It's built on top of [Pandas](http://pandas.pydata.org/), [Numpy](http://numpy.pydata.org/) and [Dask](http://dask.pydata.org/) and stores the data in the [Parquet file format](http://parquet.apache.org/) (via [Fastparquet](https://github.com/dask/fastparquet)) in a hierarchical directory structure. Files are compressed using [Snappy](http://google.github.io/snappy/), a fast and efficient compression/decompression library from Google.

The end result?

A powerful, pythonic datastore for Pandas dataframes that can **easily query millions of rows in sub-second speed**.

PyStore provides namespaced collections of data. These collections allow bucketing data by source, user or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.). Each collection (or namespace) maps to a directory containing partitioned parquet files for each item (e.g. symbol).

---

## Getting started with PyStore

Let's get started with PyStore by first installing it using `pip`:

```bash
$ pip install PyStore
```

**INSTALLATION NOTE:** If installation fails, you probably need to install Snappy first (compression/decompression library). You can install Snappy C library with following commands:

* APT: ``sudo apt-get install libsnappy-dev``
* RPM: ``sudo yum install libsnappy-devel``
* Brew: ``brew install snappy``

\* Windows users should checkout [Snappy for Windows](https://snappy.machinezoo.com) and [this Stackoverflow post](https://stackoverflow.com/a/43756412/1783569) for help on installing Snappy and python-snappy.

---

Next, let's get some market data to work with. We'll use Quandl's API to download 37+ years worth of historical data for Apple's stock.

```python
import quandl

aapl = quandl.get("WIKI/AAPL", authtoken="your token here")
aapl.head()
```

Output should look something like this:

```text
             Open   High    Low  ...  Adj. Close  Adj. Volume
Date
1980-12-12  28.75  28.87  28.75  ...    0.422706    117258400
1980-12-15  27.38  27.38  27.25  ...    0.400652     43971200
1980-12-16  25.37  25.37  25.25  ...    0.371246     26432000
1980-12-17  25.87  26.00  25.87  ...    0.380362     21610400
1980-12-18  26.63  26.75  26.63  ...    0.391536     18362400
```

Next, let's explore the `PyStore` library...
We'll start by importing it to our code.

```python
import pystore
```

Next, we'll tell **PyStore** where to save our datastore.
This step is optional, and unless specified, the default path is set to `~/.pystore`).

```python
# Set storage path
pystore.set_path('/usr/share/pystore')
```

We can get a list of datastores found this location. Since we're just getting started, all we'll get is an empty list.

```python
# List stores
pystore.list_stores()
# returns: []
```

Now it's time to create our first datastore.

```python
# Connect to datastore (create it if not exist)
store = pystore.store('mydatastore')
```

Now when we call ``pystore.list_stores()`` we'll get a list with our new datastore listed.

```python
pystore.list_stores()
# returns: ['mydatastore']
```

Before we can save our APPL timeseries data, we need ao create a **Collection**.
As mentioned earlier, each collection (or namespace) maps to a directory containing partitioned parquet files for each item (e.g. symbol).

```python
# Access a collection (create it if not exist)
collection = store.collection('NASDAQ.EOD')

# List all collections in the datastore
store.list_collections()

# returns ['NASDAQ.EOD']
```

Now we're ready to store our data. For demo purposes, we won't be storing the last row, which will be appended later. We'll also attach some metadata indicateing the data source.

```python
# Store the data in the collection under "AAPL"
%time collection.write('AAPL', aapl[:-1], metadata={'source': 'Quandl'})
```

On my laptop (2015 Macbook Pro), saving 37+ worth of historical data took less than 16 milliseconds. Not too shabby.

```text
CPU times: user 14.5 ms, sys: 2.75 ms, total: 17.2 ms
Wall time: 15.8 ms
```

To see all available items (symbols) in the collection, we'll use:

```python
# List all items in the collection
collection.list_items()

# returns: ['AAPL']
```

Now let's see how long does it take to read the data:

```python
# Reading the item's data
%time item = collection.item('AAPL')

```

```text
CPU times: user 4.38 ms, sys: 1.11 ms, total: 5.49 ms
Wall time: 4.5 ms
```

**4.5 milliseconds**. NICE!

Let's take a closer look at the returned ``item`` object, specifically
the ``data`` property, which returns a [Dask dataframe](http://dask.pydata.org/en/latest/dataframe.html) and the ``metadata`` property, which returns, well, the metadata.

```python
# Load AAPL Dask dataframe (see dask.pydata.org)
item.data
```

(**TL;DR;** A Dask DataFrame is a large parallel dataframe composed of many smaller Pandas dataframes, split along the index. These pandas dataframes may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask dataframe operation triggers many operations on the constituent Pandas dataframes.)

```python
# Load AAPL metadata
item.metadata
```

```text
{'source': 'Quandl', '_updated': '2018-06-05 16:04:56.203825'}
```

To get the entire data as a Pandas dataframe, we call:

```python
# load data as Pandas dataframe
df = item.to_pandas()
df.tail()
```

We should get something like this:

```text
              Open    High     Low  ...  Adj. Close  Adj. Volume
Date
2018-03-20  175.24  176.80  174.94  ...     175.240     19314039
2018-03-21  175.04  175.09  171.26  ...     171.270     35247358
2018-03-22  170.00  172.68  168.60  ...     168.845     41051076
2018-03-23  168.39  169.92  164.94  ...     164.940     40248954
2018-03-26  168.07  173.10  166.44  ...     172.770     36272617
```

To check how fast we can read the data as a Pandas dataframe, we can run this code:

```python
%time collection.item('AAPL').to_pandas()
```

```text
CPU times: user 7.28 ms, sys: 1.26 ms, total: 8.54 ms
Wall time: 7.57 ms
```

It took **7.57 milliseconds** to read the data *and* convert it back into Pandas dataframe format on a 2015 Macbook Pro. Not too bad at all...

Let's append the last day (row) to our item:

```python
# Append a row to AAPL
collection.append('AAPL', aapl[-1:])
collection.item('AAPL').to_pandas().tail()
```

We should see our new row added:

```text
              Open    High     Low  ...  Adj. Close  Adj. Volume
Date
2018-03-21  175.04  175.09  171.26  ...     171.270     35247358
2018-03-22  170.00  172.68  168.60  ...     168.845     41051076
2018-03-23  168.39  169.92  164.94  ...     164.940     40248954
2018-03-26  168.07  173.10  166.44  ...     172.770     36272617
2018-03-27  173.68  175.15  166.92  ...     168.340     38962839
```

That's the basic stuff which should allow you get started using PyStore in no time.

**But there's more...**

After a while, you'll have many items stored, and you may want to look some of them up by metadata. To do this, simple add your metadata key to the ``list_items`` method:

```python
# Query avaialable symbols based on metadata
collection.list_items(source='Quandl')

# returns: ['AAPL']
```

### Snapshot functionality

When working with data, there will be times when you'll accidentally mess up the data, making it unusable. For that reason, PyStore allows you to create *snapshots* - a point-in-time, *named* reference for all current items in a collection.

Creating a snapshot is done using the ``create_snapshot`` method:

```python
# Snapshot a collection
collection.create_snapshot('snapshot_name')

# List available snapshots
collection.list_snapshots()

# returns ['snapshot_name']
```

To see how snapshots work, let's change our original AAPL
to only include the `Close` and `Volume` columns.

```python
# Change the current dataset
collection.write('AAPL', aapl[['Close', 'Volume']],
                 metadata={'source': 'Quandl'},
                 overwrite=True)

# Load the "new" item
collection.item('AAPL').to_pandas().tail()
```

As we can see, we've lost all data except for the `Close` and `Volume` columns.

```text
              Close    Volume
Date
2018-03-21  171.270  35247358
2018-03-22  168.845  41051076
2018-03-23  164.940  40248954
2018-03-26  172.770  36272617
2018-03-27  168.340  38962839
```

However, we can load our snapshot data at any time using:

```python
# Get a version of a symbol given a snapshot name
snap_df = collection.item('AAPL', snapshot='snapshot_name')

snap_df.to_pandas().tail()
```

And, Voilà, we can see our data again.

```text
              Open    High     Low  ...  Adj. Close  Adj. Volume
Date
2018-03-21  175.04  175.09  171.26  ...     171.270     35247358
2018-03-22  170.00  172.68  168.60  ...     168.845     41051076
2018-03-23  168.39  169.92  164.94  ...     164.940     40248954
2018-03-26  168.07  173.10  166.44  ...     172.770     36272617
2018-03-27  173.68  175.15  166.92  ...     168.340     38962839
```

We can, of course, restore our data from the snapshot:

```python
collection.write('AAPL', snap_df.to_pandas(),
                 metadata={'source': 'Quandl'},
                 overwrite=True)
```

Lastly, we'll delete the snapshot:

```python
# Delete a collection snapshot
collection.delete_snapshot('snapshot_name')

# To delete all snapshots, use:
# collection.delete_snapshots()
```

To delete the collection and the datastore, run:

```python
# Delete the item from the current version
collection.delete_item('AAPL')

# Delete the collection
store.delete_collection('NASDAQ.EOD')
```

That concludes this quick tutorial of **PyStore**.
I hope you'll find it useful as I do.

---

## Speed test

In the above example, we've used EOD data (9,400 rows). As you recall, the write time was 15.8 ms and the read time (to Pandas) was 7.57 ms.

Let's try writing and reading 5 years worth of **minute level** data for the VIX Futures continuous contract (2,319,867 rows).

```python
# Test write speed
%time collection.write('VX', vixdf)
```

```text
CPU times: user 491 ms, sys: 148 ms, total: 639 ms
Wall time: 419 ms
```

Writing obviuosly takes longer here, as we're dealing with 246 times more data.
But, still, 419 ms is not bad.

Let's see how long it takes us to read that amount of data.

```python
# Test read speed
%time collection.item('VIX')
```

```text
CPU times: user 4.19 ms, sys: 2.24 ms, total: 6.42 ms
Wall time: 5.32 ms
```

It took me **5.32 milliseconds to reading 2,319,867 records** on my Laptop.
But the big question, is how long will it take to read the data *and* convert it to Pandas dataframe?

Let's see:

```python
# Test read speed
%time collection.item('VIX').to_pandas()
```

```text
CPU times: user 170 ms, sys: 79 ms, total: 249 ms
Wall time: 179 ms
```

So... it takes PyStore **179 milliseconds to read and parse 2,319,867 records** . Now this is Impressive!

---

Remember I mentioned portability earlier? Well, moving your datastore can't be easier. Simply copy your datastore directory (default is ``~/.pystore/<DATASTORE>``) to an external drive to move it to a new machine.

For peaking inside your Parquet files, I recommend [Chip](https://parquetviewer.com), a lightweight parquet viewer.

---

[PyStore](https://pypi.org/pypi/pystore/) is now available via ``pip`` or directly from [Github page](https://github.com/ranaroussi/pystore).

It currently works with local filesystem, and I plan on adding support for Amazon S3, Google Cloud Storage and Hadoop Distributed File System in the future.

Please let me know what you think!