Key issue in data design for SAP BW on HANA

Low on memory

Recently, I encountered a situation with a SAP BW on HANA system: memory resources were gradually diminishing over time. Things were not critical yet, but it was serious enough to start thinking about contingency plans.

Diving into the matter, I discovered some valuable new insights on HANA database tables and primary keys. That really surprised me. I thought I had my knowledge on column-oriented databases more or less sorted (ha!).

In this blog I want to share these insights that have ‘changed my metrics’ on HANA table design. A DSO designed to hold large data sets requires extra considerations with respect to keys. That is: when size matters.

The original issue, by the way, was resolved by reconfiguring some database parameters (well spotted, basis consultants!) and by re-partitioning the largest tables. I’ll leave that story for what it is. The take away for me, though, was the opportunity to perform the analysis and gain the insights.

Quick analysis

When you deal with this kind of sizing issues, a convenient way to survey table sizes—my preference—is with SQL. That is, if you have the right statements at hand. For a quick analysis, here are some statements to execute.

First action is to assess the total size of all HANA tables combined. Under transaction DB02 or DB50 / Diagnostics / SQL editor, issue the following command:
SELECT SUM("MEMORY_SIZE_IN_TOTAL") FROM "SYS"."M_CS_TABLES“

What I got back was 2.2 TB (no comment!). Next, I want the top ten of large tables. Execute:

SELECT top 10 "MEMORY_SIZE_IN_TOTAL", "TABLE_NAME" FROM "SYS"."M_CS_TABLES" ORDER BY "MEMORY_SIZE_IN_TOTAL" DESC

With that, the issue was revealed: four BW DSOs active tables each used up 239, 217, 212, and 140 GB of memory. That was 40% the total memory space.

The four tables, accidentally, were of similar structure and belonged to one and the same application. Quite some memory footprint for a single application, but at the same time this application is really important to the business—just for your knowledge.

The real revelation, however, is yet to come, when I probed further down into the table itself.

On the surface

Having found the culprit, it was time to inspect this largest DSO with RSA1, in BW modeling. These are the characteristics that I found:

  • Width of 70 fields
  • The primary key spans over 12 fields, concatenating 210 characters (red flag!)
  • Contains 1.8 billion records (another red flag!)
  • Old-style ODSO, part of a semantically partitioned object

I was told that the design for this table was made several years back, on BW 7.0 or thereabout, when databases were always row-oriented and the data was always stored on disk. I imagine that, back then, this design did provide a sensible trade-off between resource consumption and query performance.

But nowadays, with column-oriented and in-memory technology, this balance is changed. After migration to HANA, this aspect was not addressed for this DSO, and the design was left unchanged.

Next to that, the number of records is also an issue. Having too much records in one DSO (10% short of the hard limit) is a bad idea for a number of reasons. Data becomes “unwieldy” at these numbers, and simply transferring the data, i.e. resolving the issue, is complicated in a productive system.

Final aspect to consider is that this is an old-style ODSOs, part of a SPO (semantically partitioned object). Migrating to modern-style ADSO with more HANA-relevant features would mean a total redesign of the application, since ADSOs do not fit under SPO. There is no budget for that.

Size per column

In a column based table, each column is an index, and each column claims a piece of memory. Let’s switch back to SQL and inspect the memory sizes per column with this command:
SELECT * from M_CS_ALL_COLUMNS WHERE TABLE_NAME=''
Next, the top 8 result after sorting by memory size in total:

COLUMN_NAME MEMORY_SIZE_IN_TOTAL UNCOMPRESSED_SIZE COMPRESSION_RATIO
$trexexternalkey$ 160.915.274.646 282.880.351.397 56,9%
$/BIC/ZPOSTALCD$COUNTRY$ 9.053.498.998 27.667.189.316 32,7%
/BIC/ZAAAAAAA 8.990.776.431 14.755.837.289 60,9%
$rowid$ 8.024.151.788 14.755.836.896 54,4%
/BIC/ZBBBBBBB 6.379.452.119 12.013.097.222 53,1%
/BIC/ZPOSTALCD 4.382.491.079 14.755.837.289 29,7%
/BIC/ZCCCCCCC 3.458.533.685 12.051.831.278 28,7%
/BIC/ZDDDDDDD 1.409.622.394 18.655.065.601 7,6%

Note: I obscured some InfoObject names.

The top entry $trexexternalkey$ is the primary key, which is taking up 75% of the total memory of this table! Already anticipated, but this is really problematic, because the 12 fields concatenated key has no function in querying. This index is not used. This is bad design with catastrophic consequences for memory in HANA. Later more on this.

Runner up is $/BIC/ZPOSTALCD$COUNTRY$, which is a postal code compounded with the country. This is what is called a ‘Compounded InfoObject’ in BW. A few lines below is the entry /BIC/ZPOSTALCD, the same object without the compounded field. What stands out is that the compounded version is more than twice as large. Also, more on that later.

What you also see is a column named $rowid$. This is a standard component of a column-based table. It is an index that builds up the table rows. It is normal when this “column” is one of the top objects in memory size.

Lesson 1: avoid concatenated keys when possible

A concatenated key in a HANA database produces a relatively large memory footprint. Sometimes this is unavoidable, like a document line / document number pair, but otherwise this should be avoided as much as possible.

Especially in the case above, where the DSO had 12 key fields. The underlying design issue is that the content of the DSO happens to be process logs from a production machine on the work floor, i.e. not transactions with an identifying key. In this situation, appointing 12 columns as surrogate key is not a suitable solution in HANA.

There are various ways to deal with that kind of logging data. Depending on the requirements, consider these two options:

  • Design an ADSO with “all characteristics are key”. Every load is additive, so if receiving the same data twice is a risk, then you should take care of that.
  • Create an ADSO with a surrogate key in one new field, i.e. a hash of 32 characters, based on the identifying (possibly all) fields. The hash needs to be developed in ABAP by an expert, and is part of the transformation. You can apply a load-to-overwrite scenario, so that receiving the same data twice will just overwrite each individual record. This works fine under the premise that hash-collision (the unlikely coincidence that two distinct records map onto the same hash value) is insignificant and does not degrade the overall quality of the data.

Lesson 2: low cardinality key first

Another remarkable fact is that $/BIC/ZPOSTALCD$COUNTRY$ takes up twice the memory of /BIC/ZPOSTALCD. The postal code, in this case, is a 7-character code. How can adding two positions for country lead to double the size of the index?

I found out that this has to do with data compression. Data compression is a vital component of an in-memory database, and there are many ways and standards to implement this. I expect that HANA has a variety of compression algorithms, and the one best suited is chosen by the database engine.  Check out Wikipedia, if you want to, https://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression.

Point is, that for composed keys, compression algorithms are likely to work more efficient when the low-cardinality columns are the first sort keys. For instance: country code first, followed by postal code. Or line number first, followed by document number. For very large tables, it makes sense to test this and find out what works best in your situation.

That brings me back to $/BIC/ZPOSTALCD$COUNTRY$. Looking at the description: is the order perhaps postal code first, followed by country code? That would explain the double size of the composed key, so I’m inclined to think that that’s indeed the case.

Conclusion

When you design tables in BW, intended to hold large volumes of data, it is worthwhile to consider sizing and memory consumption in the design phase. Use of composed keys or a composed primary key comes with a penalty in memory resources. If you can avoid these issues, then your application is likely more efficient and thus more cost effective.

The only issue left on your shoulders is how to make your application with “efficiency under the hood” stand out among others, and be hired for the next big data job!

Hey SAP-insider, if you’re reading this, would you maybe care to comment on this detail…? I’d be obliged…