EnterpriseDB’s Postgres Plus solutions let you confidently develop and deploy PostgreSQL-backed applications that scale all the way from embedded solutions to massive OLTP and data warehouse systems that serve thousands of users.
DW & BI
Data warehousing is the science of storing and retrieving data so that some meaningful analysis can be performed on that data to corroborate / support a business decision or to predict a business outcome.
Business Intelligence, on the other hand, is simply the art and science of presenting historical data in a meaningful way (often by using different data visualization techniques). Raw data stored in databases turns into valuable information through the implementation of Business Intelligence processes.
Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining and predictive analytics.
A data warehouse is a subject oriented, non-volatile, integrated, time variant collection of data in support of management's decisions. A data warehouse is an electronically stored collection of integrated data that can be used for the purpose of intelligent analysis. Data warehousing arises in an organisation's need for reliable, consolidated, unique and integrated reporting and analysis of its data, at different levels of aggregation.
DW technologies provide historical, current and predictive views of business operations by analysing the present and historical business data. Data analysis is often done using visualization techniques that turn complex data into images that tells compelling story. Raw data by this process of analysis help management take right decisions.
Database Types Based on Use Case
Any database application system can be categorized under two types based on the use case. Online Transaction Processing (OLTP) is the transaction system that collects business data, whereas Online Analytical Processing (OLAP) is the reporting and analysis system on that data.
In short, transactional or operational OLTP systems follow ER modelling, where the design is mainly driven for optimized INSERT, UPDATE & DELETE & faster SELECT for relatively few records/tuples. Efficiency is measured by number of transactions that can be handled per second. Typically tables are highly Normalized, 3rd Normal form or higher (BCNF). Usually there are lots of highly normalized tables in an OLTP system. Data is basically snapshot of ongoing business processes.
Analytical OLAP systems follow Dimensional modelling, where the design is mainly driven for optimized BULK SELECT & AGGREGATION. Fast data retrieval of aggregated results on groups/huge dataset is the need. Typically tables are highly De normalized. Related master data are usually put together in single table for fast retrieval and to avoid multiple joins to get related information. Data is basically aggregated & historical multi-dimensional views of various kinds of business activities. Two design methodologies are followed for analytics database design - either Star or Snowflake Schema.
Data Modelling Paradigms
Data Modelling depends specifically on the application and the use case. On a high level, there are two data modelling paradigms - ER modelling & Dimensional modelling.
An Entity-Relationship (ER) model is a systematic way of describing and defining a business process. The process is modelled as components (entities) that are linked with each other by relationships that express the dependencies and requirements between them. ER model or entity-relationship model is a particular methodology of data modelling wherein the goal of modelling is to normalize the data by reducing redundancy. This is different than dimensional modelling where the main goal is to improve the data retrieval mechanism.
Dimensional modelling (DM) is the name of a set of techniques and concepts used in data warehouse design. According to data warehousing consultant Ralph Kimball, DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance. Dimensional modelling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts. Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualify the data. The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.
Multiple dimension tables surrounding the central fact table in the form of a Star. Star schema provides a de-normalized design. Interestingly, the process of normalizing the dimension tables into multiple related tables is called snowflaking. The most important difference is that the dimension tables in the snowflake schema are normalized. Snow-flake increases degree of normalization in the design.
Compared to star schema the snowflake schema query is more complex, because the dimension tables are normalized. We have to add another JOIN for every new level inside the same dimension. Although there’s no hard and fast rule but typically we go with star based schema in data mart & snowflake based schema in data warehouse implementation.
Dimension (Master Data)
In SCD Type 1 history is not maintained and the table always shows the recent data. SCD Type 1 dimensions are loaded by Merge operations.
SCD Type 2 is used for full history preservation of master data using additional metadata columns Start_Date, End_Date, Latest_Flag/Version. A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys.
SCD Type 3 is used for partial history preservation, mostly by using an additional metadata column like Previous_Value. A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing.
A conformed dimension is the dimension that is shared across multiple subject area. Consider 'Customer' dimension. Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas. These dimensions are conformed dimension.
Dimensions are often reused across multiple subject area with different contextual meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Inventory Date". A Time or Date dimension often acts as a role playing dimension to qualify the dates in various fact tables like Sales Order, Inventory and Invoice.
Fact (Transactional Data)
A fact table stores some kind of measurements. Usually these measurements are stored (or captured) against a specific time and these measurements vary with respect to time. Now it might so happen that the business might not be able to capture all of its measures always for every point in time or over specified time intervals (week, month, quarter, year, etc.).
Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements. The first case is the example of incident fact and the second one is the example of snapshot fact.
Data modelling design plays the foremost vital role in the success of a data warehouse/data mart implementation project. The key to perfect Dimensional Data model depends on the source system analysis & the target analytic requirements. Secondly the Relational Database Management System supporting the decision support system plays the key role with regards to high-availability, high-performance, low-latency & concurrent usage. The RDBMS should be configured and tuned to support batch analytics query.
Below are some key features which are relevant for EDB as an analytical database.
- Full ACID-compliant, ANSI-SQL:2008 Relational Database Management System
- UPSERT (Update or Insert)
- Surrogate Keys using AUTO_INCREMENT
- Join Optimization
- Horizontal Table Partitioning – Hash, Range, List
- B-Tree Index
- Filter based B-Tree Indexes
- Expression based B-Tree Indexes
- Block Range Indexes (BRIN)
- Parallel Query
- Query Optimization Hints
- Logical data extents using Tablespace
- Window Functions- SQL Aggregates with CUBE, ROLLUP and GROUPING SETS
- Recursive Queries
- Materialized View- Online Refresh
- Stored Procedures
- Workload Management – allows you to run OLTP and OLAP on same server with resource limits at user or session level
- Session level resource limits e.g. for sorting memory
- Support for cstore – an extension to enable columnar store for analytical workloads
- Supports vast range of data types and data model – Relational, Numbers, Strings, Geo Spatial, Range, JSON, key-value pair and much more
- Foreign Data Wrappers and DB Link allow federated data access from remote and external sources e.g. other PostgreSQL databases, Oracle, MySQL, NoSQL, flat files, HDFS etc.
- JSON and XML capabilities allow easier parsing of data from external sources
EDB Setup for Analytics
Below are some recommended tuning parameters and considerations for using EDB as an analytical database.
- Kernel & OS Tuning
- Latest version & service pack of EDB Postgres Advanced Server
- Multiple Storage Drives (Hard Disk Drive + Solid State Drive)
- High-throughput RAID
- Write Ahead Log (WAL) on a separate disk (10 to 50 GB space)
- DB Server Parameter Tuning
- Limit Few Connections
- max_connections = 40
- Optimize/Raise Memory Parameters Limits
shared_buffers = ¼ of RAM
work_mem = 1GB
maintenance_work_mem = 1GB
temp_buffers = 1GB
effective_cache_size = ¾ of RAM
wal_buffers = 16MB
- No Auto Vacuum
autovacuum = off
vacuum_cost_delay = off
- Perform Vacuum & Analyze after data loading
- Maintain Table Partitioning for huge tables
EnterpriseDB Postgres can be accessed by a wide variety of BI Analytics tools, Data mining tools & Data Integration tool using JDBC & ODBC connectivity.