Episode 1:4 - History of Data Architecture

08/08/19 • 26 min

Organizations are looking to their vast data stores for nuggets of information that give them a leg up on their competition. “Big Data Analytics” and Artificial Intelligence are the technologies promising to find those gold nuggets. Mining data is accomplished through a “Distributed Data Lake Architecture” that enables cleansing, linking, and analytics of varied distributed data sources.

Ad Hoc Data Management

Data is generated in several locations in your organization.
IoT (Edge Devices) has increased the number of locations and types of data that have grown.
Organizations typically look at the data sources individually and application-centric.
- Data Scientists look at a data source and create value from it. One application at a time.
- Understanding the data and normalizing it is key to making this successful. (A zip code is a zip code, a phone number has multiple formats, Longitude, Latitude)
- Overhead of creating a Data-Centric Application is high if you start from scratch.
People begin using repeatable applications to get the benefits of reuse.

Data Warehouse Architecture

Data Warehouse architecture takes repeatable processes to the next level. Data is cleansed, linked, and normalized against a standard schema.
Data is cleansed once and used several times, with different applications.
Benefits include:
- Decrease time to answer.
- Increase reusability
- A decrease in Capital Cost
- Increase in Reliability

Data Lake Architecture

A Data Lake moves all of the data and stores all of the data in its raw format.
Data Lake Architecture uses Meta-Data to better understand the data.
Allows for late binding of applications to the data.
Gives the ability to use/see the data in different ways for different applications.
Benefits include:
- Ability to reuse data for more than one purpose
- Decrease time to create new applications
- Increase the reusability of data
- Increase Data Governance (Security and Compliance)

Distributed Data Lake (Data Mesh)

One of the biggest problems with Data Lake and Data Warehouse is the movement of data.
As data volume goes up, so does its gravity. It becomes harder to move.
Regulations can limit where the data can actually reside.
Edge devices that have data need to encrypt and manage data on edge devices before pushing to a data lake.
This architecture allows for data services to be pushed to compute elements into the edge. Including storage, encryption, cleanse, link, etc..
Data is only moved to a centralized location based on policy and data governance.
Each application developed does not need to know where the data is located. The architecture handles that for them.
Benefits include:
- Decrease time to answer.
- Late data binding to runtime.
- Multiple applications running on the same data in the same devices
- Decrease cost due to decrease movement of data.

Rise of the Stack Developer (ROSD) - DWP

Ad Hoc Data Management

Data is generated in several locations in your organization.
IoT (Edge Devices) has increased the number of locations and types of data that have grown.
Organizations typically look at the data sources individually and application-centric.
- Data Scientists look at a data source and create value from it. One application at a time.
- Understanding the data and normalizing it is key to making this successful. (A zip code is a zip code, a phone number has multiple formats, Longitude, Latitude)
- Overhead of creating a Data-Centric Application is high if you start from scratch.
People begin using repeatable applications to get the benefits of reuse.

Data Warehouse Architecture

Data Warehouse architecture takes repeatable processes to the next level. Data is cleansed, linked, and normalized against a standard schema.
Data is cleansed once and used several times, with different applications.
Benefits include:
- Decrease time to answer.
- Increase reusability
- A decrease in Capital Cost
- Increase in Reliability

Data Lake Architecture

A Data Lake moves all of the data and stores all of the data in its raw format.
Data Lake Architecture uses Meta-Data to better understand the data.
Allows for late binding of applications to the data.
Gives the ability to use/see the data in different ways for different applications.
Benefits include:
- Ability to reuse data for more than one purpose
- Decrease time to create new applications
- Increase the reusability of data
- Increase Data Governance (Security and Compliance)

Distributed Data Lake (Data Mesh)

One of the biggest problems with Data Lake and Data Warehouse is the movement of data.
As data volume goes up, so does its gravity. It becomes harder to move.
Regulations can limit where the data can actually reside.
Edge devices that have data need to encrypt and manage data on edge devices before pushing to a data lake.
This architecture allows for data services to be pushed to compute elements into the edge. Including storage, encryption, cleanse, link, etc..
Data is only moved to a centralized location based on policy and data governance.
Each application developed does not need to know where the data is located. The architecture handles that for them.
Benefits include:
- Decrease time to answer.
- Late data binding to runtime.
- Multiple applications running on the same data in the same devices
- Decrease cost due to decrease movement of data.

Rise of the Stack Developer (ROSD) - DWP

Previous Episode

Podcast 1:3 - Decreasing Ingestion Congestion with Optane DCPMM

Big Data analytics needs data to be valuable. Collecting data from different machines, IoT devices, or sensors is the first step to being able to derive value from data. Ingesting data with Kafka is a common method used to collect this data. Find out how using Intel's Optane DC Persistent Memory to decrease ingestion congestion and increase total thruput of your ingestion solution.

Kafka in real examples

Ingestion is the first step in getting data into your data lake, or data warehouse
Kafka is basically a highly available distributed PubSubHub.
Data from a producer is published on Kafka Topics which consumers subscribe to. Topics give the ability segment the data for ingestion.
Topics can be spread over a set of servers on different physical devices to increase reliability and thruput.
Performance Best Practices
- Buffer Size of the Producers should be a multiple of the message size
- Batch Size can be changed based on the message size for optimal performance
- Spread Logs for partitions across multiple drives or on fast drives.
Example Configuration (LinkedIn)
- 13 million messages per second, (2.75 GB/s)
- 1100 Kafka brokers organized into more than 60 clusters.
Automotive Space
- One Customer has 100 Million Cars - 240KB/min/car
  - 1.6 Million Messages/sec
  - 800 GB/s
- Approximate size of the installation
  - 4400 Brokers, over 240 Clusters.

Optane DC Persistent Memory

Ability to use Optane technology in a DDR4 DIMM form factor.
128GB, 256 GB, 512GB PMMs are available. Up to 3 TB per socket
Two modes of operation: App Direct Mode, and Memory Mode.
Memory Mode gives the ability to have cheaper memory than typical DDR4 prices at a fraction of the cost.
App Direct Mode means you can write a program to write directly to memory and it is persistent. Survives over reboots or power loss.
App Direct Mode can also be used to create ultra-fast filesystems with memory drives.
DCPMM uses DDR4 memory and DCPMM in a mixed mode. Example a 16G DIMM paired with a 128G PMM. or a 64G DIMM Paired with a 512GB PMM.
Memory modes can be controlled in the Bios of from the linux kernel.
- ipmctl - utility for configuring and managing Intel Optane DC persistent memory modules (PMM).
- ndctl – utility for managing (non-volatile memory device) sub-system in the Linux kernel

Improving Ingestion using Persistent Memory

Use Larger Memory Footprint for more kafa servers on the same machine with larger Heap Space
Change Kafka to write directly to Persistent Memory
Create a Persistent Memory Filesystem and point kafka logs to the new filesystem

Testing Results

Isolate performance variability by limiting the testing to one broker on the same machine as the producer.
- Remove network variability and bottleneck of the network.
- Decrease inter-broker communication and replica bottlenecks
- Only producer is run to find the maximum that can be ingested.
- Only Consumers are run to find the maximum that can be egressed.
- Mixed Producer and Consumer are run to find passthru rates.
First approach. 50% persistent memory in App Direct Mode
- 3x performance over Sata Drive mounted log files
- 2x performance over Optance NVMe drives
Second approach. 100% persistent memory in App Direct Mode
- 10x performance over Sata Drive mounted log files.
- approximately ~2 Giga Bytes per second. over 150 MB/sec for SATA drive
Additional testing has been performed with Cluster to increase total thruput and we found we were limited not by the drive speed which is normally the case, but by the network speed. We were limited to 10 G bit network.

Next Episode

Episode 1:5 - Information Management Maturity Model

Developing a Data Strategy can be difficult, especially if you don’t know where your current organization is and where it wants to go. The Information Management Maturity Model helps CDOs and CIOs find out where they currently are in their Information Management journey and their trajectory. This map helps guide organizations as they continuously improve and progress to the ultimate data organization that allows them to derive maximum business value from their data.

The model can be seen as a series of phases, starting from least mature to most mature: Standardized, Managed, Governed, Optimized, and Innovation. Many times an organization can exist in multiple phases at the same time. Look for where the majority of your organization operates, and then identify your trail-blazers that should be further along in maturity. Use your Trail-blazers to pilot or prototype new processes, technologies or organizational structures.

Standardized Phase

The standardized phase has three sub-phases. Basic, Centralized, and Simplified. Most organizations find them self somewhere in this phase of maturity. Look at the behaviors, technology, and processes that you see in your organization to find where you fit.

Basic

Almost every organization fits into this phase, at least partially. Here data is only used reactively and in an ad hoc manner. Additionally, almost all the data collected is stored based on predetermined time frames (often “forever”). Companies in BASIC do not erase data for fear of missing out on some critical information in the future. Attributes that best describe this phase are:

Management by Reaction
Uncatalogued Data
Store Everything Everywhere

Centralized (Data Collection Centralized)

As organizations begin to evaluate data strategy they first look at centralizing their storage into large Big Data Storage solutions. This approach takes the form of Cloud storage or on-prem big data appliances. Once the data is collected in a centralized location Data Warehouse technology can be used to enable basic business analytics for the derivation of actionable information. Most of the time this data is used to fix problems with customers, supply chain, product development, or any other area in your organization where data is generated and collected. The attributes that best describe this phase are:

Management by Reaction
Basic Data Collection
Data Warehouses
Big Data Storage

Simplified

As the number of data sources increase in an organization, companies begin to form organizations that focus on data strategy, organization and governance. This shift begins with a Chief Data Officer’s (CDO) office. There are several debates on where the CDO fits in the company, under the CEO or CIO. Don’t get hung up on where they sit in the organization. The important thing is to establish a data organization focus and implement a plan for data normalization. Normalization gives the ability to correlate different data sources to gather new insight into what is going on across your entire company. Note without normalization, the data remains siloed and only partially accessible. Another key attribute of this phase is the need to develop a plan to handle sheer, the massive volume of data being collected. Because of the increase in volume and cost of storing this data, tiered storage becomes important. Note that in the early stages it is almost impossible to know the optimal way to manage data storage. We recommend using the best information available to develop rational data storage plans, but with the cavett that this will need to be reviewed and improved once the data is being used. The attributes that best describe this phase are:

Predictive Data Management (Beginning of a Data-Centric Organization)
Data Normalization
Centralized Tiered Storage

Managed (Standard Data Profiles)

At this phase organizations have formalized their data organization and have segmented the different roles in the data organization: Data Scientist, Data Stewards, Data Engineers are now on the team and have defined roles and responsibilities. Meta-Data Management becomes a key factor in success at this phase, and multiple applications can now take advantage of the data in the company. Movement from a Data Warehouse to a Data Lake has taken place to allow for more agility in development of data-centric applications. Data Storage has been virtualized to allow for a more efficient and dynamic Storage solution. Data Analytics can now run on data sets from multiple sources and departments in the company. These attributes best describe this phase: