Fast Data Store for Pandas Time-Series Data using PyStore

As algorithmic traders, we need a lot of data to test and optimize our strategy ideas. Over time, the amount of data adds up and the search for a reliable, efficient and easy-to-use storage solution begins.

When I first confronted this issue, my immediate go-to solution was to use a database like MySQL or PostgreSQL. But as a Python developer and a heavy Pandas user, I was hoping for a more "Pythonic" solution. I was looking for a solution that will allow me to store Pandas dataframes fast and fetch them even faster -- all while keeping the data itself as portable as possible.

The closest solution I found was the very impressive Arctic library by Man AHL, but it required me to run a MongoDB server, and the data wasn't very easily portable.

I finally ended up with a solution that may raise some eyebrows, but I feel that it works far better in the context of small to mid-sized trading operation.

The solution I chose was not using a database at all. 😲

You see, databases are designed to do a wide variety of things with data, many of which require checks and safety mechanisms to be implemented to do these things effectively and safely. This actually hurt the performance of what you are trying to do.

On the other hand, market timeseries data is unique, both in how it is used and how it is stored. The user drobertson explained it best in this Stack Overflow comment:

Market time series data is stored in a completely different way. In fact, I would say it is prepared rather than stored. Each data item only needs to be written once and after that never needs to be modified or changed. Data items can be written sequentially, there is no need to insert anything in the middle. It needs no ACID functionality at all. They have little to no references out to any other data. The time series is effectively its own thing.

So... the solution was to use a flat-file database, using a binary file format where I can store Pandas dataframes, and that supports compression to save on disk space and make portability easier. Luckily, the Parquet file format seemed to fit the bill just right :)

The next thing was to write a tool that will allow me to read and write such files in a "pythonic" way.

Enters PyStore...

I created PyStore to help me store and retrieve Pandas dataframes, and while it can store any Pandas object, it was designed with storing timeseries data in mind.

It's built on top of Pandas, Numpy and Dask and stores the data in the Parquet file format (via Fastparquet) in a hierarchical directory structure. Files are compressed using Snappy, a fast and efficient compression/decompression library from Google.

The end result?

A powerful, pythonic datastore for Pandas dataframes that can easily query millions of rows in sub-second speed.

PyStore provides namespaced collections of data. These collections allow bucketing data by source, user or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.). Each collection (or namespace) maps to a directory containing partitioned parquet files for each item (e.g. symbol).

Getting started with PyStore

Let's get started with PyStore by first installing it using pip:

$ pip install PyStore

INSTALLATION NOTE: If installation fails, you probably need to install Snappy first (compression/decompression library). You can install Snappy C library with following commands:

APT: sudo apt-get install libsnappy-dev
RPM: sudo yum install libsnappy-devel
Brew: brew install snappy

* Windows users should checkout Snappy for Windows and this Stackoverflow post for help on installing Snappy and python-snappy.

Next, let's get some market data to work with. We'll use Quandl's API to download 37+ years worth of historical data for Apple's stock.

import quandl

aapl = quandl.get("WIKI/AAPL", authtoken="your token here")
aapl.head()

Output should look something like this:

             Open   High    Low  ...  Adj. Close  Adj. Volume
Date
1980-12-12  28.75  28.87  28.75  ...    0.422706    117258400
1980-12-15  27.38  27.38  27.25  ...    0.400652     43971200
1980-12-16  25.37  25.37  25.25  ...    0.371246     26432000
1980-12-17  25.87  26.00  25.87  ...    0.380362     21610400
1980-12-18  26.63  26.75  26.63  ...    0.391536     18362400

Next, let's explore the PyStore library... We'll start by importing it to our code.

import pystore

Next, we'll tell PyStore where to save our datastore. This step is optional, and unless specified, the default path is set to ~/.pystore).

# Set storage path
pystore.set_path('/usr/share/pystore')

We can get a list of datastores found this location. Since we're just getting started, all we'll get is an empty list.

# List stores
pystore.list_stores()
# returns: []

Now it's time to create our first datastore.

# Connect to datastore (create it if not exist)
store = pystore.store('mydatastore')

Now when we call pystore.list_stores() we'll get a list with our new datastore listed.

pystore.list_stores()
# returns: ['mydatastore']

Before we can save our APPL timeseries data, we need ao create a Collection. As mentioned earlier, each collection (or namespace) maps to a directory containing partitioned parquet files for each item (e.g. symbol).

# Access a collection (create it if not exist)
collection = store.collection('NASDAQ.EOD')

# List all collections in the datastore
store.list_collections()

# returns ['NASDAQ.EOD']

Now we're ready to store our data. For demo purposes, we won't be storing the last row, which will be appended later. We'll also attach some metadata indicateing the data source.

# Store the data in the collection under "AAPL"
%time collection.write('AAPL', aapl[:-1], metadata={'source': 'Quandl'})

On my laptop (2015 Macbook Pro), saving 37+ worth of historical data took less than 16 milliseconds. Not too shabby.

CPU times: user 14.5 ms, sys: 2.75 ms, total: 17.2 ms
Wall time: 15.8 ms

To see all available items (symbols) in the collection, we'll use:

# List all items in the collection
collection.list_items()

# returns: ['AAPL']

Now let's see how long does it take to read the data:

# Reading the item's data
%time item = collection.item('AAPL')

CPU times: user 4.38 ms, sys: 1.11 ms, total: 5.49 ms
Wall time: 4.5 ms

4.5 milliseconds. NICE!

Let's take a closer look at the returned item object, specifically the data property, which returns a Dask dataframe and the metadata property, which returns, well, the metadata.

# Load AAPL Dask dataframe (see dask.pydata.org)
item.data

(TL;DR; A Dask DataFrame is a large parallel dataframe composed of many smaller Pandas dataframes, split along the index. These pandas dataframes may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask dataframe operation triggers many operations on the constituent Pandas dataframes.)

# Load AAPL metadata
item.metadata

{'source': 'Quandl', '_updated': '2018-06-05 16:04:56.203825'}

To get the entire data as a Pandas dataframe, we call:

# load data as Pandas dataframe
df = item.to_pandas()
df.tail()

We should get something like this:

              Open    High     Low  ...  Adj. Close  Adj. Volume
Date
2018-03-20  175.24  176.80  174.94  ...     175.240     19314039
2018-03-21  175.04  175.09  171.26  ...     171.270     35247358
2018-03-22  170.00  172.68  168.60  ...     168.845     41051076
2018-03-23  168.39  169.92  164.94  ...     164.940     40248954
2018-03-26  168.07  173.10  166.44  ...     172.770     36272617

To check how fast we can read the data as a Pandas dataframe, we can run this code:

%time collection.item('AAPL').to_pandas()

CPU times: user 7.28 ms, sys: 1.26 ms, total: 8.54 ms
Wall time: 7.57 ms

It took 7.57 milliseconds to read the data and convert it back into Pandas dataframe format on a 2015 Macbook Pro. Not too bad at all...

Let's append the last day (row) to our item:

# Append a row to AAPL
collection.append('AAPL', aapl[-1:])
collection.item('AAPL').to_pandas().tail()

We should see our new row added:

              Open    High     Low  ...  Adj. Close  Adj. Volume
Date
2018-03-21  175.04  175.09  171.26  ...     171.270     35247358
2018-03-22  170.00  172.68  168.60  ...     168.845     41051076
2018-03-23  168.39  169.92  164.94  ...     164.940     40248954
2018-03-26  168.07  173.10  166.44  ...     172.770     36272617
2018-03-27  173.68  175.15  166.92  ...     168.340     38962839

That's the basic stuff which should allow you get started using PyStore in no time.

But there's more...

After a while, you'll have many items stored, and you may want to look some of them up by metadata. To do this, simple add your metadata key to the list_items method:

# Query avaialable symbols based on metadata
collection.list_items(source='Quandl')

# returns: ['AAPL']

Snapshot functionality

When working with data, there will be times when you'll accidentally mess up the data, making it unusable. For that reason, PyStore allows you to create snapshots - a point-in-time, named reference for all current items in a collection.

Creating a snapshot is done using the create_snapshot method:

# Snapshot a collection
collection.create_snapshot('snapshot_name')

# List available snapshots
collection.list_snapshots()

# returns ['snapshot_name']

To see how snapshots work, let's change our original AAPL to only include the Close and Volume columns.

# Change the current dataset
collection.write('AAPL', aapl[['Close', 'Volume']],
                 metadata={'source': 'Quandl'},
                 overwrite=True)

# Load the "new" item
collection.item('AAPL').to_pandas().tail()

As we can see, we've lost all data except for the Close and Volume columns.

              Close    Volume
Date
2018-03-21  171.270  35247358
2018-03-22  168.845  41051076
2018-03-23  164.940  40248954
2018-03-26  172.770  36272617
2018-03-27  168.340  38962839

However, we can load our snapshot data at any time using:

# Get a version of a symbol given a snapshot name
snap_df = collection.item('AAPL', snapshot='snapshot_name')

snap_df.to_pandas().tail()

And, Voilà, we can see our data again.

              Open    High     Low  ...  Adj. Close  Adj. Volume
Date
2018-03-21  175.04  175.09  171.26  ...     171.270     35247358
2018-03-22  170.00  172.68  168.60  ...     168.845     41051076
2018-03-23  168.39  169.92  164.94  ...     164.940     40248954
2018-03-26  168.07  173.10  166.44  ...     172.770     36272617
2018-03-27  173.68  175.15  166.92  ...     168.340     38962839

We can, of course, restore our data from the snapshot:

collection.write('AAPL', snap_df.to_pandas(),
                 metadata={'source': 'Quandl'},
                 overwrite=True)

Lastly, we'll delete the snapshot:

# Delete a collection snapshot
collection.delete_snapshot('snapshot_name')

# To delete all snapshots, use:
# collection.delete_snapshots()

To delete the collection and the datastore, run:

# Delete the item from the current version
collection.delete_item('AAPL')

# Delete the collection
store.delete_collection('NASDAQ.EOD')

That concludes this quick tutorial of PyStore. I hope you'll find it useful as I do.

Speed test

In the above example, we've used EOD data (9,400 rows). As you recall, the write time was 15.8 ms and the read time (to Pandas) was 7.57 ms.

Let's try writing and reading 5 years worth of minute level data for the VIX Futures continuous contract (2,319,867 rows).

# Test write speed
%time collection.write('VX', vixdf)

CPU times: user 491 ms, sys: 148 ms, total: 639 ms
Wall time: 419 ms

Writing obviuosly takes longer here, as we're dealing with 246 times more data. But, still, 419 ms is not bad.

Let's see how long it takes us to read that amount of data.

# Test read speed
%time collection.item('VIX')

CPU times: user 4.19 ms, sys: 2.24 ms, total: 6.42 ms
Wall time: 5.32 ms

It took me 5.32 milliseconds to reading 2,319,867 records on my Laptop. But the big question, is how long will it take to read the data and convert it to Pandas dataframe?

Let's see:

# Test read speed
%time collection.item('VIX').to_pandas()

CPU times: user 170 ms, sys: 79 ms, total: 249 ms
Wall time: 179 ms

So... it takes PyStore 179 milliseconds to read and parse 2,319,867 records . Now this is Impressive!

Remember I mentioned portability earlier? Well, moving your datastore can't be easier. Simply copy your datastore directory (default is ~/.pystore/<DATASTORE>) to an external drive to move it to a new machine.

For peaking inside your Parquet files, I recommend Chip, a lightweight parquet viewer.

PyStore is now available via pip or directly from Github page.

It currently works with local filesystem, and I plan on adding support for Amazon S3, Google Cloud Storage and Hadoop Distributed File System in the future.

Please let me know what you think!

Updated on 5 June 2018. For the latest version and comments, please see:
https://aroussi.com/post/fast-datastore-for-pandas-time-series-data