Business Intelligence and Web 2.0: sql

One of the impediments to providing Business Intelligence to management is the technical expertise required to make a request to the data warehouse. A busy executive cannot be expected to be able to use a database language like SQL and must therefore depend upon standard rigid reports from OLAP reporting tools. In this post, I will demonstrate how easy it is to provide a technically untrained executive with a graphical tool that will allow him to generate millions of combinations of queries to standard data warehouse.

The challenge appears difficult because, not only is SQL a technically complex language based upon mathematical set theory, but the SQL queries that go to a database designed as a data warehouse are particularly complex to code. However, within these challenges lies the very solution to providing real data warehouse power and flexibility to the non-technical user.

Every request for data from an orthodox data warehouse is complex but exactly the same in form—only the parameters change in a statement that must have the same components in the same places to utilize the analytical power of the data warehouse. The vehicle must be the same while the options differ depending upon the specific questions asked by the analyst. Let’s dissect the standard data warehouse query and factor out the generalities from the particulars of a specific query and thereby determine what are the parameters are to a “Universal OLAP Query.”

First of all, data warehouses have a particular architecture to their data that remains the same whether we are using a specialized “multidimensional database” or a standard relational database designed to implement a multidimensional “star schema.” This architecture provides the analyst with a set of facts that can be grouped and summed according to the dimensions of the facts. For example, where a company keeps a data warehouse of retail sales, the design of the data warehouse would appear as follows where the blocks are tables (a single “fact” table that is linked to four “dimension tables” that contain the sets of objects that the retail sales are related to):

Although the dimensions are specific to this data warehouse, the structure of a fact table attached to a set of dimensions remains the same in any orthodox data warehouse, accounting for its description as a “multidimensional database” or a “star schema.” Much of the complexity of an OLAP data query is a result of the joins that must occur between the Fact table and one or more of the dimensional tables. The solution to creating a universal OLAP query is automate this join process and provide the non-technical analyst with a user interface that allows him to simply select the facts and dimension that those facts will be rolled up by. Know this, we can see that the star schema query will always appear in the following form:

select Fact, Dimension Criteria
from Fact Table, Dimensional Tables
where Join Criteria, Slicing Criteria
group by Dimensional Criteria

This structure remains the same for each OLAP query, the things that change from query to query are the underlined terms that we can turn into parameters to our universal OLAP query generator. Reviewing these terms in the above example gives us the following:

This reduces the syntax of SQL (or some other technical language) to a set of parameters that can be chosen from a graphical user interface that uses drag-and-drop or some other selection process for our non-technical user, however, there are a lot of parameters here and we can reduce them somewhat. The identity of the Fact Table can be programmatically inferred from the selection of Facts, eliminating it as a parameter. Likewise, the Dimension Tables and the Join Criteria can be inferred from Dimensional Criteria and the Slicing Criteria (We assume that the Universal Query Generator has been initialized with the metadata that describes the data warehouse). This reduces the input to the generator as follows:

A user interface that has been configured with the description of the data warehouse can then provide a simple wizard that allows the user to select the facts that he wants rolled up (summed, averaged, etc.), the attributes of the dimensions by which the data will be rolled up (for example, customer age group, product color, time periods, etc.), and the limits of the data (slicing criteria). The Facts and Dimensional Criteria are simple selection operations, while the slicing criteria can involve simple to more involved logic as data is sliced with “and” and “or” conditions. However, the whole thing can be done with an intuitive user interface that requires no technical language skills.

There is no problem in all mathematics that cannot be solved by direct counting.
Ernst Mach

In the two previous posts, virtual data was described as data that was produced by an algorithm from already stored data. It is produced upon demand by the computer and its programs as opposed to being stored and maintained within the computer’s database.

Virtual data is typically the output or finished product of an algorithm that runs within an application in the computer. The computer executes the algorithm upon the user’s request, and real stored data is read and converted by the algorithm into new “virtual” data that is not directly kept in memory.

This application of an algorithm is always deterministic and therefore one algorithm along with one set of input from stored data should provide only one set of unique information. To produce another piece of virtual data, either the algorithm or the input data (or the combination of both) must change. Some variations upon this model can be realized by parameterizing the algorithm to allow it to operate with user’s options – essentially, to make it a slightly different algorithm.

The sums that are produced from a data warehouse are a particularly powerful illustration of producing virtual data from a parameterized algorithm working upon stored data. The algorithm can be considered a particular type of database query, perhaps written in SQL, and the parameters are the dimensions by which the algorithm summarizes the data. This is the most powerful form of virtual data because the dimensions can be combined in thousands upon thousands of different combinations, producing an endless variety of virtual data. The SQL query in a data warehouse can be parameterized into being essentially millions of different algorithms.

Most of the numbers that run our businesses are virtual data that can be produced by combining and summarizing the company’s simple financial transactions. The data warehouse is the factory that can produce this virtual data, and millions of variations on it, through its powerful ability to be parameterized by dimensions.

Business Intelligence and Web 2.0

Friday, November 7, 2008

The Universal OLAP Query

Friday, July 11, 2008

6. Virtual Data and Algorithms

Blog Archive

The Tao of Financial Information

Technorati

Related Links

About Me