Showing posts with label Data Management. Show all posts
Showing posts with label Data Management. Show all posts

Tuesday, 16 October 2012

Collaborative Data Management – Need of the hour!

Well the topic may seem like a pretty old concept, yet a vital one in the age of Big Data, Mobile BI and the Hadoops! As per FIMA 2012 benchmark report Data Quality (DQ) still remains as the topmost priority in data management strategy:

What gets measured improves!’ But often Data Quality (DQ) initiative is a reactive strategy as opposed to being a pro-active one; consider the impact bad data could have in a financial reporting scenario – brand tarnish, loss of investor confidence.

But are the business users aware of DQ issue? A research report by ‘The Data Warehousing Institute’, suggested that more that 80% of the business managers surveyed believed that the business data was fine, but just half of their technical counterparts agreed on the same!!! Having recognized this disparity, it would be a good idea to match the dimensions of data and the business problem created due to lack of data quality.

Data Quality Dimensions – IT Perspective

 

  • Data Accuracy – the degree to which data reflects the real world
  • Data Completeness – inclusion of all relevant attributes of data
  • Data Consistency –  uniformity of data  across the enterprise
  • Data Timeliness – Is the data up-to-date?
  • Data Audit ability – Is the data reliable?

 

Business Problems – Due to Lack of Data Quality

Department/End-Users

Business Challenges

Data Quality Dimension*

Human Resources

The actual employee performance as reviewed by the manager is not in sync with the HR database, Inaccurate employee classification based on government classification groups – minorities, differently abled

Data consistency, accuracy

Marketing

Print and mailing costs associated with sending duplicate copies of promotional messages to the same customer/prospect, or sending it to the wrong address/email

Data timeliness

Customer Service

Extra call support minutes due to incomplete data with regards to customer and poorly-defined metadata for knowledge base

Data completeness

Sales

Lost sales due to lack of proper customer purchase/contact information that paralysis the organization from performing behavioral analytics

Data consistency, timeliness

‘C’ Level

Reports that drive top management decision making are not in sync with the actual operational data, getting a 360o view of the enterprise

Data consistency

Cross Functional

Sales and financial reports are not in sync with each other – typically data silos

Data consistency, audit ability

Procurement

The procurement level of commodities are different from the requirement of production resulting in excess/insufficient inventory

Data consistency, accuracy

Sales Channel

There are different representations of the same product across ecommerce sites, kiosks, stores and the product names/codes in these channels are different from those in the warehouse system. This results in delays/wrong items being shipped to the customer

Data consistency, accuracy

*Just a perspective, there could be other dimensions causing these issues too

As it is evident, data is not just an IT issue but a business issue too and requires a ‘Collaborative Data Management’ approach (including business and IT) towards ensuring quality data. The solution is multifold starting from planning, execution and sustaining a data quality strategy. Aspects such as data profiling, MDM, data governance are vital guards that helps to analyze data, get first-hand information on its quality and to maintain its quality on an on-going basis.

Collaborative Data Management – Approach

Key steps in Collaborative Data Management would be to:

  • Define and measure metrics for data with business team
  • Assess existing data for the metrics – carry out a profiling exercise with IT team
  • Implement data quality measures as a joint team
  • Enforce a data quality fire wall (MDM) to ensure correct data enters the information ecosystem as a governance process
  • Institute Data Governance and Stewardship programs to make data quality a routine and stable practice at a strategic level

This approach would ensure that the data ecosystem within a company is distilled as it involves business and IT users from each department at all hierarchy.

Thanks for reading, would appreciate your thoughts.

 

Collaborative Data Management – Need of the hour!

Well the topic may seem like a pretty old concept, yet a vital one in the age of Big Data, Mobile BI and the Hadoops! As per FIMA 2012 benchmark report Data Quality (DQ) still remains as the topmost priority in data management strategy:

What gets measured improves!’ But often Data Quality (DQ) initiative is a reactive strategy as opposed to being a pro-active one; consider the impact bad data could have in a financial reporting scenario – brand tarnish, loss of investor confidence.

But are the business users aware of DQ issue? A research report by ‘The Data Warehousing Institute’, suggested that more that 80% of the business managers surveyed believed that the business data was fine, but just half of their technical counterparts agreed on the same!!! Having recognized this disparity, it would be a good idea to match the dimensions of data and the business problem created due to lack of data quality.

Data Quality Dimensions – IT Perspective

 

  • Data Accuracy – the degree to which data reflects the real world
  • Data Completeness – inclusion of all relevant attributes of data
  • Data Consistency –  uniformity of data  across the enterprise
  • Data Timeliness – Is the data up-to-date?
  • Data Audit ability – Is the data reliable?

 

Business Problems – Due to Lack of Data Quality

Department/End-Users

Business Challenges

Data Quality Dimension*

Human Resources

The actual employee performance as reviewed by the manager is not in sync with the HR database, Inaccurate employee classification based on government classification groups – minorities, differently abled

Data consistency, accuracy

Marketing

Print and mailing costs associated with sending duplicate copies of promotional messages to the same customer/prospect, or sending it to the wrong address/email

Data timeliness

Customer Service

Extra call support minutes due to incomplete data with regards to customer and poorly-defined metadata for knowledge base

Data completeness

Sales

Lost sales due to lack of proper customer purchase/contact information that paralysis the organization from performing behavioral analytics

Data consistency, timeliness

‘C’ Level

Reports that drive top management decision making are not in sync with the actual operational data, getting a 360o view of the enterprise

Data consistency

Cross Functional

Sales and financial reports are not in sync with each other – typically data silos

Data consistency, audit ability

Procurement

The procurement level of commodities are different from the requirement of production resulting in excess/insufficient inventory

Data consistency, accuracy

Sales Channel

There are different representations of the same product across ecommerce sites, kiosks, stores and the product names/codes in these channels are different from those in the warehouse system. This results in delays/wrong items being shipped to the customer

Data consistency, accuracy

*Just a perspective, there could be other dimensions causing these issues too

As it is evident, data is not just an IT issue but a business issue too and requires a ‘Collaborative Data Management’ approach (including business and IT) towards ensuring quality data. The solution is multifold starting from planning, execution and sustaining a data quality strategy. Aspects such as data profiling, MDM, data governance are vital guards that helps to analyze data, get first-hand information on its quality and to maintain its quality on an on-going basis.

Collaborative Data Management – Approach

Key steps in Collaborative Data Management would be to:

  • Define and measure metrics for data with business team
  • Assess existing data for the metrics – carry out a profiling exercise with IT team
  • Implement data quality measures as a joint team
  • Enforce a data quality fire wall (MDM) to ensure correct data enters the information ecosystem as a governance process
  • Institute Data Governance and Stewardship programs to make data quality a routine and stable practice at a strategic level

This approach would ensure that the data ecosystem within a company is distilled as it involves business and IT users from each department at all hierarchy.

Thanks for reading, would appreciate your thoughts.

 

Collaborative Data Management – Need of the hour!

Well the topic may seem like a pretty old concept, yet a vital one in the age of Big Data, Mobile BI and the Hadoops! As per FIMA 2012 benchmark report Data Quality (DQ) still remains as the topmost priority in data management strategy:

What gets measured improves!’ But often Data Quality (DQ) initiative is a reactive strategy as opposed to being a pro-active one; consider the impact bad data could have in a financial reporting scenario – brand tarnish, loss of investor confidence.

But are the business users aware of DQ issue? A research report by ‘The Data Warehousing Institute’, suggested that more that 80% of the business managers surveyed believed that the business data was fine, but just half of their technical counterparts agreed on the same!!! Having recognized this disparity, it would be a good idea to match the dimensions of data and the business problem created due to lack of data quality.

Data Quality Dimensions – IT Perspective

 

  • Data Accuracy – the degree to which data reflects the real world
  • Data Completeness – inclusion of all relevant attributes of data
  • Data Consistency –  uniformity of data  across the enterprise
  • Data Timeliness – Is the data up-to-date?
  • Data Audit ability – Is the data reliable?

 

Business Problems – Due to Lack of Data Quality

Department/End-Users

Business Challenges

Data Quality Dimension*

Human Resources

The actual employee performance as reviewed by the manager is not in sync with the HR database, Inaccurate employee classification based on government classification groups – minorities, differently abled

Data consistency, accuracy

Marketing

Print and mailing costs associated with sending duplicate copies of promotional messages to the same customer/prospect, or sending it to the wrong address/email

Data timeliness

Customer Service

Extra call support minutes due to incomplete data with regards to customer and poorly-defined metadata for knowledge base

Data completeness

Sales

Lost sales due to lack of proper customer purchase/contact information that paralysis the organization from performing behavioral analytics

Data consistency, timeliness

‘C’ Level

Reports that drive top management decision making are not in sync with the actual operational data, getting a 360o view of the enterprise

Data consistency

Cross Functional

Sales and financial reports are not in sync with each other – typically data silos

Data consistency, audit ability

Procurement

The procurement level of commodities are different from the requirement of production resulting in excess/insufficient inventory

Data consistency, accuracy

Sales Channel

There are different representations of the same product across ecommerce sites, kiosks, stores and the product names/codes in these channels are different from those in the warehouse system. This results in delays/wrong items being shipped to the customer

Data consistency, accuracy

*Just a perspective, there could be other dimensions causing these issues too

As it is evident, data is not just an IT issue but a business issue too and requires a ‘Collaborative Data Management’ approach (including business and IT) towards ensuring quality data. The solution is multifold starting from planning, execution and sustaining a data quality strategy. Aspects such as data profiling, MDM, data governance are vital guards that helps to analyze data, get first-hand information on its quality and to maintain its quality on an on-going basis.

Collaborative Data Management – Approach

Key steps in Collaborative Data Management would be to:

  • Define and measure metrics for data with business team
  • Assess existing data for the metrics – carry out a profiling exercise with IT team
  • Implement data quality measures as a joint team
  • Enforce a data quality fire wall (MDM) to ensure correct data enters the information ecosystem as a governance process
  • Institute Data Governance and Stewardship programs to make data quality a routine and stable practice at a strategic level

This approach would ensure that the data ecosystem within a company is distilled as it involves business and IT users from each department at all hierarchy.

Thanks for reading, would appreciate your thoughts.

 

Friday, 24 August 2012

Emerging DB Technology – Columnar Database


Today’s Top Data-Management Challenge:

Businesses today are challenged by the ongoing explosion of data. Gartner is predicting data growth will exceed 650% over the next five years. Organizations capture, track, analyze and store everything from mass quantities of transactional, online and mobile data, to growing amounts of machine-generated data. In fact, machine-generated data, including sources ranging from web, telecom network and call-detail records, to data from online gaming, social networks, sensors, computer logs, satellites, financial transaction feeds and more, represents the fastest-growing category of Big Data. High volume web sites can generate billions of data entries every month.

As volumes expand into the tens of terabytes and even the petabyte range, IT departments are being pushed by end users to provide enhanced analytics and reporting against these ever increasing volumes of data. Managers need to be able to quickly understand this information, but, all too often, extracting useful intelligence can be like finding the proverbial ‘needle in the haystack.

How do columnar databases work?

The defining concept of a column-store is that the values of a table are stored contiguously by column. Thus the classic supplier table from supplier and parts database would be stored on disk or in memory something like:  S1S2S3S4S52010302030LondonParis Paris LondonAthensSmithJonesBlakeClarkAdams



This is in contrast to a traditional row-store which would store the data more like this:
S120LondonSmithS210Paris JonesS330Paris BlakeS420LondonClarkS530AthensAdams
From this simple concept flows all of the fundamental differences in performance, for better or worse, between a column-store and a row-store. For example, a column-store will excel at doing aggregations like totals and averages, but inserting a single row can be expensive, while the inverse holds true for row-stores. This should be apparent from the above diagram.

The Ubiquity of Thinking in Rows:

Organizing data in rows has been the standard approach for so long that it can seem like the only way to do it. An address list, a customer roster, and inventory information—you can just envision the neat row of fields and data going from left to right on your screen.

Databases such as Oracle, MS SQL Server, DB2 and MySQL are the best known row-based databases.
Row-based databases are ubiquitous because so many of our most important business systems are transactional.
Data Set Ex:  See the below data set contents of 20 columns X 50 Millions of Rows.


Example Data Set
Row-oriented databases are well suited for transactional environments, such as a call center where a customer’s entire record is required when their profile is retrieved and/or when fields are frequently updated.

Other examples include:
• Mail merging and customized emails
• Inventory transactions
• Billing and invoicing

Where row-based databases run into trouble is when they are used to handle analytic loads against large volumes of data, especially when user queries are dynamic and ad hoc.

To see why, let’s look at a database of sales transactions with 50-days of data and 1 million rows per day. Each row has 30 columns of data. So, this database has 30 columns and 50 million rows. Say you want to see how many toasters were sold for the third week of this period. A row-based database would return 7-million rows (1 million for each day of the third week) with 30 columns for each row—or 210-million data elements. That’s a lot of data elements to crunch to find out how many toasters were sold that week. As the data set increases in size, disk I/O becomes a substantial limiting factor since a row-oriented design forces the database to retrieve all column data for any query.

As we mentioned above, many companies try to solve this I/O problem by creating indices to optimize queries. This may work for routine reports (i.e. you always want to know how many toasters you sold for the third week of a reporting period) but there is a point of diminishing returns as load speed degrades since indices need to be recreated as data is added. In addition, users are severely limited in their ability to quickly do ad-hoc queries (i.e. how many toasters did we sell through our first Groupon offer? Should we do it again?) that can’t depend on indices to optimize results.


Pivoting Your Perspective: Columnar Technology

Column-oriented databases allow data to be stored column-by-column rather than row-by-row. This simple pivot in perspective—looking down rather than looking across—has profound implications for analytic speed. Column-oriented databases are better suited for analytics where, unlike transactions, only portions of each record are required. By grouping the data together this way, the database only needs to retrieve columns that are relevant to the query, greatly reducing the overall I/O.

Returning to the example in the section above, we see that a columnar database would not only eliminate
43 days of data, it would also eliminate 28 columns of data. Returning only the columns for toasters and units sold, the columnar database would return only 14 million data elements or 93% less data. By returning so much less data, columnar databases are much faster than row-based databases when analyzing large data sets. In addition, some columnar databases (such as Infobright®) compress data at high rates because each column stores a single data type (as opposed to rows that typically contain several data types), and allow compression to be optimized for each particular data type. Row-based databases have multiple data types and limitless range of values, thus making compression less efficient overall.

Thanks For Reading This Blog. View More:: BI Analytics