Monday, April 29, 2013

Large-Scale Processing in Netezza.


Transitioning from ETL to ELT

CIO: Why is that uber-powered [commodity RDBMS] system running out of steam? Didn’t we just upgrade?
MANAGER: Yes, but the upgrade didn’t take.
CIO: Didn’t take? Sounds like a doctor transplanting an organ. Do you mean the CPUs rejected it? (laughing)
MANAGER: (soberly) No, just the users. Still too slow.
CIO: That hardware plant cost us [X] million dollars and it had better get it done or I’ll dismantle it for parts. I might dismantle your prima-donna architects with it!

MANAGER: I can’t explain it. All the white papers said… (drifts into commodity mumbo-jumbo)
CIO: I took the brochure-tour too. All of your people wore T-shirts with the vendor’s logo last week. I see they’re back to plain polo shirts. Did something happen?
MANAGER: The expected numbers aren’t there. We’re tuning.
CIO: Tuning? They’re off by orders of magnitude. You’re not tuning your way out of this.
MANAGER: (tuning out) Yes, I know. (shuffles away mumbling)
Don’t reduce managers to mush. It’s just not fair. Why saddle them with the wrong hardware for the job and then expect it to produce what it cannot, because it was never designed to?
Outside of a database, “bulk data processing” means one thing: extract, transform, load (ETL) with dozens of tools to choose from. These choices can quickly cascade into a domino effect of necessary infrastructure such as an ETL machine, plus people who know the tool and how to apply it. Scalability is not just about hardware, but also logistics; the goal is to produce more functionality without an army of developers to support it. But what if tool jockeys aren’t required? What if a little UNIX knowledge—plus a good handle on both SQL and our own information—is the core skill set for a large-scale, very successful data processing environment?
Bulk data processing inside a database machine comes in a single flavor: an insert/select statement using SQL, usually between intermediate database tables and/or persistent database tables (aka ELT). Apart from the SQL-generation mechanics to support this (no handcrafted SQL statements for business logic!), let’s look under the hardware covers to determine if this approach is viable for larger scales of data.
General-purpose platforms are “horizontal,” which means that the CPUs are physically separated from disk drives, connected by a backplane, and shared-everything at the hardware level. Copying means pulling large quantities of data from the disks and over the backplane, then through the CPUs as well as through software with logical representations of the tables, and finally back down the backplane and back onto the disks. With this flow, the data meets itself coming and going on the (saturated) backplane. This necessitates big, fat, monolithic queries because we have to deal with the data while it’s in our hands, so to speak. This model, Multiple Instruction Multiple Data (MIMD), simply does not scale for bulk processing.
An IBM® Netezza® platform, by contrast, is a purpose-built, proprietary assembly of carefully optimized commodity parts where CPUs are mated to a disk drive (a shared-nothing hardware). When we create a table, it exists once logically but parts of it also reside physically on each CPU/disk combination—so 10 of these CPU/disk pairs would store a 1,000-record table in 10 locations with 100 records each. A single query addresses all the CPUs at once; this means that the time required to move 10 groups of 100 records in parallel is one-tenth of the time required to move the entire 1,000 as a single block. This model is known as Single Instruction Multiple Data (SIMD). It scales radically for bulk processing, and it’s why a TwinFin® with 90 or more CPUs can process billions of records in a fraction of the time it takes a similarly powered general-purpose competitor.

Transforming more than just data

Many who teeter on the decision to move logic from their ETL machine into Netezza get hung up on a simple question: Can we scale the approach? In short, the SQL insert/select statement has some logistics we must harness in order to rest on a solid approach foundation. Consider this:
Insert into EMPLOYEE (EMP_ID, EMP_FNAME, EMP_LNAME, EMP_START_DATE) select ID, FirstName, LastName, St_Date from OldEmployees;
Seems simple, or perhaps simplistic. How about this?
Insert into EMPLOYEE select ID, FirstName, LastName, St_Date from OldEmployees;
Note that in the preceding query, the insert phrase is implied. Some of you will recognize the very significant danger here. If later we add a column into the middle of the pack, it will offset the other columns and misalign them with the target columns. Some say that at least one of the target columns will break… but will they? I’ve seen cases where misaligned queries ran for days or months before anyone caught them. They just so happened to align with acceptable target data types even though the content was all wrong. We must bring this issue under control.
Now what if this was a fact table with 150 columns? The insert/select clauses immediately become unwieldy and unmanageable. What if we need to add or subtract columns, or update the query to account for a data model change? These conditions functionally freeze the queries, making the solution brittle. Little changes will shatter its operation.
Our best approach is to line up the target columns with the source rules (the elements of the SQL select phrase that will fill the target columns):1
EMP_ID                                    ID
EMP_FNAME                        FirstName
EMP_LNAME                        LastName
EMP_START_DATE             St_Date
We can extend a template like this to the limits of the database machine. The target columns will always line up and the source/select columns will never offset. Does it sound like we’re signing up for a bit more than just sticking with the ETL tool? Don’t be fooled. The object is to simplify, not complicate the logistics of generating a SQL statement. This structure helps maintain it without a painful impact review. It’s easy enough to scan with simple tools and compare its content with the catalog contents to find, apply, or just report changes. Keep in mind, however, that any time a data model changes, practically all of the columns are pass-through.1 In this case we concentrate on the exceptions, potentially automating over 90 percent of the data model’s reconciliation. Turnaround for a new data model then becomes agile—perhaps very agile.
With this kind of power over query generation, we now have the freedom to manufacture lots of queries without losing logistical control. Can we scale the approach? Yes—and it’s already being done. A more advanced form of this template drives the framework engine,2 and it’s not uncommon to see dozens of such queries execute in a controlled order toward a functional goal, which keeps it simple to maintain and reuse.
A primary logistical difference between MIMD and SIMD: with MIMD we’re forced into using fewer, more complex SQL statements, which are necessarily serialized because they saturate the hardware. SIMD encourages us to use more and simpler SQL statements and run them in parallel because the machine dispatches them so quickly. So SIMD finishes faster, by orders of magnitude, than its MIMD counterpart.
General-purpose MIMD platforms require initial (and perpetual) complexity, but purpose-built SIMD platforms allow us to begin simply and to stay simple forever. And simple means things are easier to maintain, extend, and troubleshoot—creating the foundation for agility throughout the life of the system.

References

1 Compleat Netezza, Copyright © 2012 VMII ISBN: 978-1-4610-9574-3. Excerpted by permission.
Netezza Data Integration Framework, Copyright © 2007-2012 Brightlight Consulting. All rights reserved.

No comments:

Post a Comment