miércoles, 16 de octubre de 2013

Columbus Day on Packt Publishing!

Get a 50% discount on Packt Publishing books



Greetings,

Blag.
Developer Experience.

jueves, 3 de octubre de 2013

SAP HANA - Let's talk about Rows and Columns

As we all know...SAP HANA can handle table creation in two ways...Rows and Columns. Row tables are suppose to be better for tables that get updated very often and Column tables for tables that doesn't get updated very often...Column Tables are supposed to be faster when it comes to read their contents.

As I have never actually test this...I create two Python scripts to generate 1 million records for two tables...DOC_HEADER and DOC_DETAILS. These two tables share the DOCUMENT_ID, YEAR and AREA but as the information get created randomly, chances that both tables have some values in common are not really known...

For the sake of the example I create both tables DOC_HEADER and DOC_DETAILS using both Rows and Column types and build another Python script to measure the reading time...

Column_Row_SAPHANA_Test.py
import dbapi
import time

conn = dbapi.connect('*****', 30015, 'SYSTEM', '****')
cur = conn.cursor()

query = '''SELECT H."DOCUMENT_ID", H."YEAR", H."AREA", SUM("AMOUNT") AS AMOUNT
           FROM "BLAG"."DOC_HEADER_ROW" AS H INNER JOIN "BLAG"."DOC_DETAIL_ROW" AS D
           ON H."DOCUMENT_ID" = D."DOCUMENT_ID"
           AND H."YEAR" = D."YEAR"
           AND H."AREA" = D."AREA"
           GROUP BY H."DOCUMENT_ID", H."YEAR", H."AREA"'''

start = time.clock()
cur.execute(query)
ret = cur.fetchall()
end = time.clock()

counter = 0

for row in ret:
    counter += 1
time_taken = end - start
print("Row Store - %s records in %s seconds") %(counter, time_taken)

query = '''SELECT H."DOCUMENT_ID", H."YEAR", H."AREA", SUM("AMOUNT") AS AMOUNT
           FROM "BLAG"."DOC_HEADER_COLUMN" AS H INNER JOIN "BLAG"."DOC_DETAIL_COLUMN" AS D
           ON H."DOCUMENT_ID" = D."DOCUMENT_ID"
           AND H."YEAR" = D."YEAR"
           AND H."AREA" = D."AREA"
           GROUP BY H."DOCUMENT_ID", H."YEAR", H."AREA"'''

start = time.clock()
cur.execute(query)
ret = cur.fetchall()
end = time.clock()

counter = 0

for row in ret:
    counter += 1
time_taken = end - start
print("Column Store - %s records in %s seconds") %(counter, time_taken)

Both queries do the exact same thing...they do an INNER JOIN between the two tables...using ROW store for the first one and COLUMN store for the second one...

Let's run this 3 times and see what happens -;)

Benchmark
Row Store - 7669 records in 2.45972177882 seconds
Column Store - 7669 records in 1.59409638594 seconds

Row Store - 7669 records in 2.32650395252 seconds
Column Store - 7669 records in 1.68382533783 seconds

Row Store - 7669 records in 2.44930342075 seconds
Column Store - 7669 records in 1.52396673192 seconds

As we can see...we have a constant gain of about 38% faster when using Column store...

Greetings,

Blag.
Developer Experience.