DW Implementation Using EDB Postgres

Updated on Oct 02, 2020

PostgreSQL is an advanced open source relational database management system. EDB develops and integrates performance, security, and manageability enhancements into PostgreSQL to support enterprise-class workloads for its database, EDB Postgres Advanced Server. EDB has also developed database compatibility for Oracle to facilitate the migration of workloads from Oracle to EDB Postgres and to support the operation of many Oracle workloads on EDB Postgres.

EDB Postgres

EnterpriseDB’s Postgres Plus solutions let you confidently develop and deploy PostgreSQL-backed applications that scale all the way from embedded solutions to massive OLTP and data warehouse systems that serve thousands of users.

DW & BI

Data warehousing is the science of storing and retrieving data so that some meaningful analysis can be performed on that data to corroborate / support a business decision or to predict a business outcome.

Business Intelligence, on the other hand, is simply the art and science of presenting historical data in a meaningful way (often by using different data visualization techniques). Raw data stored in databases turns into valuable information through the implementation of Business Intelligence processes.

Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining and predictive analytics.

Data Warehouse

A data warehouse is a subject oriented, non-volatile, integrated, time variant collection of data in support of management's decisions. A data warehouse is an electronically stored collection of integrated data that can be used for the purpose of intelligent analysis. Data warehousing arises in an organisation's need for reliable, consolidated, unique and integrated reporting and analysis of its data, at different levels of aggregation.

DW technologies provide historical, current and predictive views of business operations by analysing the present and historical business data. Data analysis is often done using visualization techniques that turn complex data into images that tells compelling story. Raw data by this process of analysis help management take right decisions.

Database Types Based on Use Case

Any database application system can be categorized under two types based on the use case. Online Transaction Processing (OLTP) is the transaction system that collects business data, whereas Online Analytical Processing (OLAP) is the reporting and analysis system on that data.

In short, transactional or operational OLTP systems follow ER modelling, where the design is mainly driven for optimized INSERT, UPDATE & DELETE & faster SELECT for relatively few records/tuples. Efficiency is measured by number of transactions that can be handled per second. Typically tables are highly Normalized, 3rd Normal form or higher (BCNF). Usually there are lots of highly normalized tables in an OLTP system. Data is basically snapshot of ongoing business processes.

Analytical OLAP systems follow Dimensional modelling, where the design is mainly driven for optimized BULK SELECT & AGGREGATION. Fast data retrieval of aggregated results on groups/huge dataset is the need. Typically tables are highly De normalized. Related master data are usually put together in single table for fast retrieval and to avoid multiple joins to get related information. Data is basically aggregated & historical multi-dimensional views of various kinds of business activities. Two design methodologies are followed for analytics database design - either Star or Snowflake Schema.

Data Modelling Paradigms

Data Modelling depends specifically on the application and the use case. On a high level, there are two data modelling paradigms - ER modelling & Dimensional modelling.

An Entity-Relationship (ER) model is a systematic way of describing and defining a business process. The process is modelled as components (entities) that are linked with each other by relationships that express the dependencies and requirements between them. ER model or entity-relationship model is a particular methodology of data modelling wherein the goal of modelling is to normalize the data by reducing redundancy. This is different than dimensional modelling where the main goal is to improve the data retrieval mechanism.

Dimensional modelling (DM) is the name of a set of techniques and concepts used in data warehouse design. According to data warehousing consultant Ralph Kimball, DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance. Dimensional modelling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts. Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualify the data. The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.

Dimensional Modelling

Multiple dimension tables surrounding the central fact table in the form of a Star. Star schema provides a de-normalized design. Interestingly, the process of normalizing the dimension tables into multiple related tables is called snowflaking. The most important difference is that the dimension tables in the snowflake schema are normalized. Snow-flake increases degree of normalization in the design.

Compared to star schema the snowflake schema query is more complex, because the dimension tables are normalized. We have to add another JOIN for every new level inside the same dimension. Although there’s no hard and fast rule but typically we go with star based schema in data mart & snowflake based schema in data warehouse implementation.

Dimension (Master Data)

In SCD Type 1 history is not maintained and the table always shows the recent data. SCD Type 1 dimensions are loaded by Merge operations.

SCD Type 2 is used for full history preservation of master data using additional metadata columns Start_Date, End_Date, Latest_Flag/Version. A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys.

SCD Type 3 is used for partial history preservation, mostly by using an additional metadata column like Previous_Value. A type 3 dimension stored the history in a separate column instead of separate rows. So unlike a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing.

A conformed dimension is the dimension that is shared across multiple subject area. Consider 'Customer' dimension. Both marketing and sales department may use the same customer dimension table in their reports. Similarly, a 'Time' or 'Date' dimension will be shared by different subject areas. These dimensions are conformed dimension.

Dimensions are often reused across multiple subject area with different contextual meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Inventory Date". A Time or Date dimension often acts as a role playing dimension to qualify the dates in various fact tables like Sales Order, Inventory and Invoice.

Fact (Transactional Data)

A fact table stores some kind of measurements. Usually these measurements are stored (or captured) against a specific time and these measurements vary with respect to time. Now it might so happen that the business might not be able to capture all of its measures always for every point in time or over specified time intervals (week, month, quarter, year, etc.).

Then those unavailable measurements can be kept empty (Null) or can be filled up with the last available measurements. The first case is the example of incident fact and the second one is the example of snapshot fact.

Key Note

Data modelling design plays the foremost vital role in the success of a data warehouse/data mart implementation project. The key to perfect Dimensional Data model depends on the source system analysis & the target analytic requirements. Secondly the Relational Database Management System supporting the decision support system plays the key role with regards to high-availability, high-performance, low-latency & concurrent usage. The RDBMS should be configured and tuned to support batch analytics query.

EDB Features

Below are some key features which are relevant for EDB as an analytical database.

Full ACID-compliant, ANSI-SQL:2008 Relational Database Management System

UPSERT (Update or Insert)

Surrogate Keys using AUTO_INCREMENT

Join Optimization

Horizontal Table Partitioning – Hash, Range, List

B-Tree Index

Filter based B-Tree Indexes

Expression based B-Tree Indexes

Block Range Indexes (BRIN)

Parallel Query

Query Optimization Hints

Logical data extents using Tablespace

Window Functions- SQL Aggregates with CUBE, ROLLUP and GROUPING SETS

Recursive Queries

Materialized View- Online Refresh

Stored Procedures

Workload Management – allows you to run OLTP and OLAP on same server with resource limits at user or session level

Session level resource limits e.g. for sorting memory

Support for cstore – an extension to enable columnar store for analytical workloads

Supports vast range of data types and data model – Relational, Numbers, Strings, Geo Spatial, Range, JSON, key-value pair and much more

Foreign Data Wrappers and DB Link allow federated data access from remote and external sources e.g. other PostgreSQL databases, Oracle, MySQL, NoSQL, flat files, HDFS etc.

JSON and XML capabilities allow easier parsing of data from external sources

EDB Setup for Analytics

Below are some recommended tuning parameters and considerations for using EDB as an analytical database.

Kernel & OS Tuning

Latest version & service pack of EDB Postgres Advanced Server

Multiple Storage Drives (Hard Disk Drive + Solid State Drive)

High-throughput RAID

Write Ahead Log (WAL) on a separate disk (10 to 50 GB space)

DB Server Parameter Tuning

Limit Few Connections
- max_connections = 40

Optimize/Raise Memory Parameters Limits
shared_buffers = ¼ of RAM
work_mem = 1GB
maintenance_work_mem = 1GB
temp_buffers = 1GB
effective_cache_size = ¾ of RAM
wal_buffers = 16MB

No Auto Vacuum
autovacuum = off
vacuum_cost_delay = off

Perform Vacuum & Analyze after data loading

Maintain Table Partitioning for huge tables

EDB Connectivity

EnterpriseDB Postgres can be accessed by a wide variety of BI Analytics tools, Data mining tools & Data Integration tool using JDBC & ODBC connectivity.