HaskellDB Performance
Things did not go well with HaskellDB when I ran my query across a large data set. MySQL as a query optimizer, sucked. Postgres seems to be doing better.
I started with a mysql database containing 11 years of day data from the ASX. This includes the closing price (and some other info) of all stocks traded on the Australian Stock Exchange (ASX) since 1997.
The tables are organized with unique keys and, based on some testing, include appropriate indexes.Things still ran slowly.
I tried tuning mysql. Then, based on the slow queries logged, tried running them by hand to see if I was missing any indexes. No such luck. The query just didn’t perform.
I simplified the query and ran it again (see below). Now it ran fast, but doesn’t help with the generated query. Looks like MySQL’s optimizer doesn’t optimize enough.
MySQL timing:
- Query 1: 6.25s (anywhere from 24s - 6s)
- Query 2: 0.15s
Next step. Install Postgres.
Postgres timing:
- Query 1: 0.03s
- Query 2: 0.03s
To avoid the cache impacting the results, both servers were stopped and re-started between each query run. With caching, Postgres gets down to 0.2s. MySQL will still have 6s/0.3s. Tests were run on a MacBook with 2Gb of memory.
Both queries were run with the command line tool of the respective database engines, mysql and psql. The end_of_day table has 4,749,025 records and the stock table has 25,863 records.
Query 1 – HaskellDB generated
SELECT closing_price2 as closing_price,
trade_date2 as trade_date
FROM (SELECT stock_id as stock_id2,
trade_date as trade_date2,
closing_price as closing_price2
FROM end_of_day as T1) as T1,
(SELECT stock_id as stock_id1,
ticker as ticker1
FROM stock as T1) as T2
WHERE stock_id1 = stock_id2
AND ticker1 IN ('FXJ','NWS','WAN')
AND trade_date2 = '2008-02-14 00:00:00';
Query 2 – Hand written
SELECT e.closing_price as closing_price,
e.trade_date as trade_date
FROM
stock s, end_of_day e
WHERE
s.stock_id = e.stock_id
AND s.ticker IN ('FXJ','NWS','WAN')
AND e.trade_date = '2008-02-14 00:00:00';
So don’t use HaskellDB and MySQL if you are doing joins across large tables. Use Postgres instead.
The Haskell query:
indexStock db tkrs stockDate = do
let q = do
s < - table S.stock
e <- table E.end_of_day
restrict (s!S.stock_id .==. e!E.stock_id .&&.
s!S.ticker `_in` tickers .&&.
e!E.trade_date .==. constant stockDate)
r <- project (closing_price << e!E.closing_price #
trade_date << e!E.trade_date)
return r
where
tickers = map (constant) tkrs
rs <- query db q
return $ sum $ map (\r -> read (r!closing_price)) rs

May 19th, 2008 at 6:23 am
That query generated by HaskellDB is absolutely ridiculous. It should probably be considered a bug in their joining algorithm.
I’m building a relational algebra library in Ruby and it will not produce such retarded queries:
stocks
.join(end_of_days)
.on(stocks[:stock_id].eq(end_of_days[:stock_id])
.where(
stocks[:ticker].in([‘FXJ’,’NWS’,’WAN’]),
end_of_days[:trade_date].eq(Time.now)
)
.project(end_of_days[:closing_price], end_of_days[:trade_date])
This will produce something practically equivalent to your query (the only salient difference being that the join condition will be in an ON clause).
This, of all things, was trivial to implement. Much harder is joining the same table to itself multiple times and performing joins in the presence of aggregations (which necessitates subselects). I suppose they are doing subselects as a naive way to solve all of these problems but you can write your query generator in such a way as to only produce subselects when it’s actually necessary (i.e., there is no other way to express the dataset).
May 19th, 2008 at 6:42 am
As I understand it, HaskellDB used to do lots of optimisations, but they mostly turned out to be incorrect in certain cases so were disabled. I guess it’s “just” a matter of figuring out the right side conditions and reenabling them.
May 19th, 2008 at 8:49 am
Nick – agreed that the query isn’t the best but it does have the advantage of being correct. We developed a ORM layer at work, and many of the issues are around the generated code being correct. I can live with performance issues provided I’m getting the right results.
The bit that annoyed me was that MySQL wouldn’t allow me to tune around the query. For a declarative language such as SQL, you shouldn’t have to change it to match the underlying implementation. But then that is the ideal.
Ganesh – that would make sense. My understanding is that HaskellDB was re-written to ensure it worked in a larger number of circumstances. Optimization would be the next thing to look at. I think that the test suite included with it should be beefed up before attempting that though.
May 19th, 2008 at 10:59 am
I think that although Nick has a valid point, things are not at all simple when considering the complexity of the back-end application generating such SQL queries. Most of the times, such applications are far too generic in nature, and it generating correct SQL while working across a domain is prioritized over generating correct SQL per case.
Also, being a declarative language, I have always fancied of being someone who’d rather not worry about DB level optimizations, and rather concentrates on what I am actually supposed to do.
Then, if Postgres takes care of something, where MySQL blows it up, is more of an advantage that Postgres provides…
I’d look at it that way.
May 19th, 2008 at 2:55 pm
Robins – This is part of why I’m looking into the performance aspect. I want to get comfortable enough with how it works for some samples so I don’t have to think about it. I can just write nice declarative queries in Haskell and let the backend worry about how it delivers the results.