Objective of this tutorial are:
- Understand data marts and its types
- Learn the methods to build data warehouse
- Distinguish between types of dimensions
- Explain difference between types of facts
- Understand Hierarchy and Data warehouse Architecture
- Explain what are dimension and facts
- Learn importance of Dimension and fact table.
Data Marts
- Data marts is a smallest version of Data warehouse.
- Data marts deal with a single project.
- Data marts are focussed on one area. Hence they draw data from a limited number of sources.
- Time taken to build the data marts is very low compared to the time taken to build a DataWarehouse.
Difference between Data Warehouse and Data Marts:
Data warehouse Data Marts
Enterprise wide data Department wide data
Multiple subject areas Single Subject Area
Multiple Data sources Limited data source
Occupies large memory Occupies limited memory
Longer time to implement Shorter time to implement
Different types of Data Marts:
Dependent Data Mart:
Loaded from Data warehouse. First you form a Data Warehouse then you form a Data Marts.
"The data is first extracted from the OLTP systems and then populated in the central Data warehouse. From the Data warehouse the data then travel down to the data marts."
Q: Why someone will have a dependent Data Mart:
Ans: You will go for Data mart so that you do not have to access the whole Data warehouse for fetching some reports and you want quick turn around time from your source queries.
Independent Data Mart
Independent data mart is the one that depend on the OLTP source that is available. So the data is loaded directly from the transaction system into the data mart based on the requirement of the reports.
That's is suitable for small organization or small groups within an organization, which can be easily developed. So Independent Data mart is quick solution for quick requirements for small organization.
Hybrid Data Mart:
It is a mix of both Independent and Dependent Data Mart.
A hybrid data mart allows you to combine input from sources other than a data warehouse. This could be useful for many situations, especially when you need ad hoc integration, such as after a new group or product is added to the organization.
Most of the business scenarios will be Hybrid. In this scenario we have to get the file data and combine the report from the file and combine the report from the data warehouse and form a consolidated report.
What to Build First
Whether to build the Data warehouse first or the data mart first. Based on this question two approaches have been defined:
Top Down Approach
- Data warehouse is build first and then the data marts are built. So Data warehouse is built on top of OLTP system.
- It is one big coherent warehouse that is built directly from the OTLP source systems. So it has consolidated data from all the OLTP sources.
- The main advantage is that the information is available at a central location. various people accessing the data will have same information at a point of time.
- This disadvantage is that this becomes a very big project which should be handled by specialists else it can go any wrong.
- The cost and time involved is big and the result are not quick.
The approach is also called Inmon approach.
Bottom Up Approach
- Data marts required by different department or specific report requirements by specific people are built first then Data warehouse is built.
- Data marts can be built very quickly, hence the result are seen much faster.
- It is easier than Top Down Approach.
- Since the entire architecture is broken down in the initial phase itself, it become easier to manage. Operations in one datamart doesn't affect the operations of other data marts.
- Usually start up go for such approach where there are limited departments and it is very easy to build data marts and then go for the data warehouse. It is cost effective, easier to implement and you do not need heavy duty resources to implement this data marts.
Top Down Approach Definition:
In this approach you first build a data warehouse and then go down to different department. The data marts are then created from the data warehouse.
Advantages of top-down design are:
Provides consistent dimensional views of data across data marts, as all data marts are loaded from the data warehouse.
This approach is robust against business changes. Creating a new data mart from the data warehouse is very easy.
Disadvantages of top-down design are:
This methodology is inflexible to changing departmental needs during implementation phase.
It represents a very large project and the cost of implementing the project is significant.
Bottom Up Approach Definition:
In the bottom-up design approach, the data marts are created first to provide reporting capability. A data mart addresses a single business area such as sales, Finance etc. These data marts are then integrated to build a complete data warehouse. The integration of data marts is implemented using data warehouse bus architecture. In the bus architecture, a dimension is shared between facts in two or more data marts. These dimensions are called conformed dimensions. These conformed dimensions are integrated from data marts and then data warehouse is built.
Advantages of bottom-up design are:
This model contains consistent data marts and these data marts can be delivered quickly.
As the data marts are created first, reports can be generated quickly.
The data warehouse can be extended easily to accommodate new business units. It is just creating new data marts and then integrating with other data marts.
Disadvantages of bottom-up design are:
The positions of the data warehouse and the data marts are reversed in the bottom-up approach design.
Operational Data Store
- Operation Data Store(ODS) contains operational data with a very short window. Basically it is a kind of data warehouse but it just store the current data.
- The ODS is refreshed frequently so that it contains very current data. It can be updated daily, hourly etc. However Data warehouse is time variant, keep historical data and not updated daily rather updated monthly. Its like a central data that call centre people may need to look before they answer customer query. Call center might need certain summarize data for certain level of statistical view of the OLTP system. That's why we need ODS.
- The frequency of refresh depends on how current the data must be for reporting purpose.
- This concept comes in between the concept of OLTP and the Datawarehouse. When real time reporting is not possible in a Datawarehouse, ODS is used.
- It provides improved access to the critical and current operational data.
Example:
When we place an order on an e-commerce site and we call up the call centre to enquire about the state.
We cannot get information from Data warehouse, as the data might not even have reached the warehouse OLTP system is not ideal to fetch the data as already described
Hence in such case ODS is used.
When the status changes, the OLTP gets updated and the same is refreshed in ODS but only the current data is maintained.
Dimension Table
- The Objects of the subject are called Dimensions. The categories under which an information can be split across is a Dimension. All the related information for that category is placed inside the dimension.
- The tables that describe the dimensions involved are called "Dimension Table". For example a Hospital related Dimension table will have all information related to Hospital like Hospital Name, Hospital ID, Address, Number of doctor. All the dimension related to hospital will be present in that dimension table. So this information (detail of hospital) shouldn't be kept or repeated in any other table.
- Basically a Dimension table is a category of information.
- Dividing a Data warehouse project into dimensions, provides structured information for reporting purpose.
It is to be noted that business users or the end users who generate reports, fire queries on these dimension tables, because they contain descriptive information.
What are Facts?
- A fact is a measure that can be summed, averaged or manipulated. If a fact is manipulated, it has to be a measure that make a business sense.
- A dimension is lined to a fact. There is a fact table and multiple dimensions will be linked to it.
- A fact table contain 2 kind of data - a dimension key and a measure. For example, a fact with column ProductID, CustomerID and Quantity. Here productID and CustomerID will be Dimension Key and Quantity will be store the actual measure.
Conformed Dimension, De-Generated Dimension, Junk Dimension, Slowly Changing Dimension are other Dimensions.
Slowly Changing Dimension:
In a data warehousing concept when we split the data into various category, ( we split into categories so that we maintain category of data in separate tables which is more or less static). Like a customer name will change but it will not change frequently. So these are dimensions which might change but do not change.
- Dimension attributes that change slowly over a period of time rather than changing regularly is grouped as SCD.
Q: Why we cannot keep fact and dimension in same table.
Ans: If dimension change then we need to change the whole fact table, update everything inside the fact table. e.g if customer name changes and you have change all the record of fact table where that information has been kept.
However if maintain Dimension and fact separately then you will have a customerID and changes to customer name will be in Dimension table only. This way we are able to save ourself from updating Customer information in Fact table.
- Dimensions might not change frequently but it could change occasionally. That's why it is called Slowly changing Dimension.
- Let's consider an example of a person changing his/her city from city 1 to city 2 and this change does not happen at regular intervals. Here person information will be in different table and his location information will be in a different Location Dimension table. So only the location Dimension will change in fact table, person information will not change.
Ans. By Data warehousing concept there are three ways in which these types of attributes are handled (Handling means the way you want to keep the new information with keeping historic information also or you want to keep only the latest information):
1. Type 1 Slowly Changing Dimension: You overwrite the old Values.
You just keep only City 2 information. No information of City 1 neither do you keep when it is changed. So you never know when person was on City 1.
2. Type 2 Slowly Changing Dimension: You add a new record in your dimension table. You do not change the key, but you add a new record in your dimension table. So you will always know where he was at a certain point of time and when he actually change.
3. Type 3 Slowly Changing Dimension: You add a new column in your dimension table. Column as in previous location, current location. So either you will keep two column only or keep adding new column.
SCD Type 1 - Overwrite the Old Value
- The advantage of this approach is that it is very easy to follow and results in huge space savings and hence cost saving.
- The disadvantage is no historical data is kept.
- It is relatively difficult to determine which record is the most recent.
- A subquery has to be written to fetch the most recent record for each ID.
- To overcome these, there are 2 approaches to indicate the most current record.
- Suppose that James started living in New York from 1st Jan, 2000, hence for the very first time the record will look as:
1 01-01-2000 31-12-9999 James New York
- If James moves to Chicago on 1st Jan, 2014, then the record as per the 1st approach will look as below:
1 01-01-2000 31-12-2003 James New York
1 01-01-2004 31-12-9999 James Chicago
Hence the most recent record can be identified using the condition End Date > Current Date.
Approach 2 - Use an Indicator.
First time when a record arrives with New York as the City, the Indicator will be set as 'Y'
ID Start Date Names City Current Indicator
1 2000 James New York Y
When the person changes the city then the Indicator of the previous record is changed to 'N' and the indicator of the current record is changes to 'Y'.
ID Start Date Names City Current Indicator
1 2000 James New York N
1 2004 James New York Y
Hence to get the most recent record, just use the condition WHERE Current_Indicator = 'Y'
Advantage: You have historical data.
Disadvantage: More space needed.
SCD Type 3 - Add a New Column
Sometimes it is not required to maintain the entire history of a slowly changing dimension. We might want to retain the current record and the immediate previous record.
In such cases instead of adding rows, it is beneficial to add columns as illustrated below:
For the first time when James moves to New York the old city column will be blank
ID Start Date Names City Old City Old Year
1 2000 James New York - -
When he moves to Chicago, the city column becomes chicago and the old city column will become New York
ID Start Date Names City Old City Old Year
1 2004 James Chicago New York 2000
The disadvantage of the above approach is clearly seen in the table above. For each attribute, there must be 2 columns, one indicating the current value and other indicating the previous value.
Additive Fact table:
If the rows can be combined to ge the final output then it is known as Additive Fact Table. For example:
No. of product sold on Day 1 = 500
No. of product sold on Day 2 = 250
Total No. of product sold on two Days = 750
Semi-Additive Fact Table
However same can not be applied in following example:
Balance of Company's account 1 for day 1 = 5000
Balance of Company's account 2 for day 1 = 3000
Two balance in two accounts of a company on Day 1 = 8000 it will be right.
Now for the data:
Balance of Company's account for day 1 = 5000
Balance of Company's account for day 2 = 3000
If a perform additive approach: Total balance in Acc in two days: 8000, Which will be wrong. This kind of table is called Semi-Additive fact table where we cannot perform addition across all the columns in fact tables.
Non-Additive Fact Table
Similarly if there is a fact table called profit margin:
Profit Margin for Day 1 = 30%
Profit Margin for Day 2 = 70%
Total profit margin for two days = 100%. This will be wrong.
Those fact tables where you cannot perform additive operation across any column then that fact table is called Non-Additive Fact table.
No Fact table
A Fact table where only dimension id's are present and no measurable attribute then this table is called No Fact Table because there must be some measurable attribute. Also called Fact less fact table.
Types of Fact
There are three types of Fact Table:Additive Fact table:
If the rows can be combined to ge the final output then it is known as Additive Fact Table. For example:
No. of product sold on Day 1 = 500
No. of product sold on Day 2 = 250
Total No. of product sold on two Days = 750
Semi-Additive Fact Table
However same can not be applied in following example:
Balance of Company's account 1 for day 1 = 5000
Balance of Company's account 2 for day 1 = 3000
Two balance in two accounts of a company on Day 1 = 8000 it will be right.
Now for the data:
Balance of Company's account for day 1 = 5000
Balance of Company's account for day 2 = 3000
If a perform additive approach: Total balance in Acc in two days: 8000, Which will be wrong. This kind of table is called Semi-Additive fact table where we cannot perform addition across all the columns in fact tables.
Non-Additive Fact Table
Similarly if there is a fact table called profit margin:
Profit Margin for Day 1 = 30%
Profit Margin for Day 2 = 70%
Total profit margin for two days = 100%. This will be wrong.
Those fact tables where you cannot perform additive operation across any column then that fact table is called Non-Additive Fact table.
No Fact table
A Fact table where only dimension id's are present and no measurable attribute then this table is called No Fact Table because there must be some measurable attribute. Also called Fact less fact table.
2 comments:
You put really very helpful information.DATA WAREHOUSE
The demand for the real transformation of the big data into the valuable form has made my mind to choose the services provided by your blog. The solutions and the data transforming techniques suggested by your services helped me to store the large quantity of data comfortably. If you guys also want to know some extra services about this blog and then you can click here for getting unique features.
Post a Comment