SAP Data

SAP PRESS E-Bites Introducing ® SAP Data Hub Michael Eacrett Swapan Saha Gaëtan Saulnier Michael Eacrett, Swapan Sah

Views 155 Downloads 5 File size 4MB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Citation preview

SAP PRESS E-Bites

Introducing ® SAP Data Hub

Michael Eacrett Swapan Saha Gaëtan Saulnier

Michael Eacrett, Swapan Saha, Gaëtan Saulnier

Introducing SAP Data Hub ®

This E-Bite is protected by copyright. It contains a digital watermark, a signature that indicates which person may use this copy. Full Legal Notes and Notes on Usage can be found at the end of this publication.

Copy No. ihp4-dakj-tb2u-36rz for personal use of Mario Massimiliano Biffi [email protected]

SAP PRESS E-Bites SAP PRESS E-Bites provide you with a high-quality response to your specific project need. If you’re looking for detailed instructions on a specific task; or if you need to become familiar with a small, but crucial sub-component of an SAP product; or if you want to understand all the hype around product xyz: SAP PRESS E-Bites have you covered. Authored by the top professionals in the SAP universe, E-Bites provide the excellence you know from SAP PRESS, in a digestible electronic format, delivered (and consumed) in a fraction of the time! Manu Kohli Introducing Machine Learning with SAP Leonardo www.sap-press.com/4710 | $24.99 | 114 pages Adeel Hashmi Implementing Machine Learning with SAP HANA www.sap-press.com/4861 | $29.99 | 141 pages Hanck, Mallory, Médaille Data Provisioning and Cleansing with SAP HANA SDI and SAP HANA SDQ www.sap-press.com/4111 | $24.99 | 97 pages

The Authors of this E-Bite Michael Eacrett is the global VP of product management for SAP Data and EIM products and technologies. Previously, Mike was VP of product management for SAP HANA. Swapan Saha is a senior director for big data product management at SAP focusing on SAP Data Hub. From 2014 to 2016, he managed SAP's global team for EIM product management. Gaëtan Saulnier is a product manager for SAP Data Hub and SAP Agile Data Preparation. He also worked in product management for the SAP Advanced Analytics team. Learn more about Michael, Swapan, and Gaëtan at www.sap-press.com/4723.

What You’ll Learn Discover how SAP Data Hub uses orchestration and workflows, data pipelines, and governance to connect and manage your data. Then, explore how it integrates with your existing tools, like SAP Data Services, SAP BW, and more. In-depth case studies will show you what challenges SAP Data Hub solves and what its implementation can look like. Integrate, process, and govern your data from a single interface!

1

2

3

4

5

6

Why SAP Data Hub? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.1 1.2

Challenges in Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . SAP Data Hub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 8

Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.1 2.2 2.3

Internet of Things Data Ingestion and Orchestration . . . . . . . . Data Science and Machine Learning Data Management . . . . . Intelligent Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 18 23

Architecture and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.1 3.2 3.3

Product Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deployment Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Product Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 31 33

Data Management Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

4.1 4.2 4.3 4.4

Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Orchestration and Workflows . . . . . . . . . . . . . . . . . . . . . . . . . Governance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . End-to-End Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46 56 64 67

Outlook and Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.1 5.2 5.3

Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Integration with SAP Data Intelligence . . . . . . . . . . . . . . . . . . . . . .

81 82 87

What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

4

1

Why SAP Data Hub?

To address recent advances in technology to manage distributed data assets, this section introduces current business scenarios in the corporate data landscape, various challenges emerging from complexities in the data landscape, and existing silos across data assets. Even with a chaotic and unmanageable data landscape, enterprises are in the middle of a digital transformation journey where data is the core. This section also presents how SAP Data Hub, part of SAP’s digital platform, reimagines this chaotic landscape by unifying data silos to offer a complete solution for data management in a distributed landscape with data of all shapes.

1.1

Challenges in Data Management

Corporate data landscapes are becoming increasingly diverse and distributed. Data volume is exploding with unstructured data, data from Internet of Things (IoT) deployments, and various social sites. Moreover, data is now stored in multiple locations and in different flavors, such as on-premise, in the cloud, in data warehouses, using data marts, and on edge devices. In the meantime, there is an increasing need to leverage existing data sources and stores and to combine them with these new data assets to address more advanced, contextual, and automated decision-making. Uncontrolled data consumption, proliferation, insufficient security, and often lack of necessary governance across the distributed data landscape make it difficult for companies to leverage and control the data to drive the right decision and to explore new business opportunities. Companies are also hindered in their ability to respond to the continuously changing market conditions by sharing intelligent data and insights. Furthermore, existing systems and processing methods, which were built primarily for managing structured transactional data, are typically point to point, highly manual, or siloed. As a result, enterprises are collecting a treasure trove of data, but they can’t unlock it to access the information needed to drive their next decade of market opportunities and profitability.

Personal Copy for Mario Massimiliano Biffi, [email protected]

5

1 Why SAP Data Hub?

Combining data from SAP and non-SAP landscapes for advanced business intelligence (BI), machine learning (ML), and IoT use cases can be challenging because there are typically missing links between enterprise data (e.g., SAP applications) and big data (e.g., IoT sensors or social media data). Therefore, it isn’t easy to operationalize data science processes in everyday business operations. Three of the biggest challenges in the data landscape can be summarized as follows: 쐍 Increasing complexities

Figure 1.1 provides a visual representation of the increasing landscape complexity problem. Challenges from point-to-point data movement of different kinds of data stored in different places—data warehouses, data marts, and cloud stores—as well as manual processes to move and access data result in an inability for the right people (e.g., digital consumers, data scientists, and business analysts) to access the right data at the right time, without relying on other technical units to deliver the data to them. This leads many companies to endeavor to improve customer engagement, optimize business processes, or build new digital services. The complexity doesn’t stop at the storage level or at the data movement level but also includes the overall human costs to maintain these landscapes. Many different technologies are involved, increasing the overall data challenge with use cases that can’t be implemented using standard solutions. In addition, multiple duplicated processes using different historic solutions are often relied on, which adds costs to train and maintain the knowledge for those implementing and delivering the different use cases. Finally, lots of emerging technologies aren’t yet standard, resulting in frequent changes of the technology used to store, process, and move the data based on the business need while the value of the data remains underused.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

6

1 Why SAP Data Hub?

Data cataloging

Video processing

Data Warehouses

Data ingestion Data replication

Data cleansing ELT

Geospatial processing Time series

Data profiling Data Marts

Customer Experience

Speech recognition

ETL Applications

Cloud Data Stores

Databases

Streaming analytics

Data quality

Graph processing

Text analytics

Machine learning

Data masking

Event Stream processing

Third-Party Data

Image processing Metadata management

Manufacturing and Supply Chain Digital Core People Engagement Network and Spend Management

Figure 1.1 Challenges in Data Management Due to Landscape Complexities 쐍 Data silos

A link is missing between enterprise data (e.g., SAP and non-SAP applications) and big data (e.g., IoT or social media data). Data is kept in silos across the enterprise in various places, and these data silos are reinforced by organization silos. For example, big data teams managing the enterprise data warehouse (EDW) may not use the same tools, practices, or policies for their data lake. User groups can’t access and work with data according to their needs across organizational boundaries. Our survey shows that 74% of enterprises have such complex data landscapes that their agility to manage data is limited. Big data technologies often lack the enterprise readiness needed for enterprise deployments in which all these technologies aren’t coming with holistic lifecycle management or security concepts, such as security compliance, audit, rules, or privileges. It isn’t easy to create a data pipeline connecting data assets managed by business silos across various technologies involved in enterprise and big data. Furthermore, all the various underlying technologies, with the lack of standards, standardization, and enterprise readiness, often imply the use of additional solutions, which further complicates the existing data challenges. For this

Personal Copy for Mario Massimiliano Biffi, [email protected]

7

1 Why SAP Data Hub?

reason, the data integration effort needed to implement the use cases prevents the creation of new business value. 쐍 Distributed systems and new technologies

With the adoption of new data storage technology (e.g., Hadoop or object stores) and requirements to support BI or SQL, layers such as Spark, ML, and advanced processing needs often add complexity in managing data. At the same time, the emergence of new distributed technology with containers and container management offers a new opportunity to address data landscape complexities.

1.2

SAP Data Hub

SAP Data Hub enables the intelligent enterprise by providing agile data management and processing of all shapes with formats residing in a diverse and distributed landscape across both cloud and on-premise business silos. It provides enterprise-wide data governance, orchestration, and pipelining to foster the development of data-driven applications, such as IoT applications, to reimagine your data landscape, as shown in Figure 1.2. Customer Experience

SAP Data Hub Subscriptions Applications

Audio Cloud Data Stores Video

Manufacturing and Supply Chain

Third-Party Data

Weather

Geospatial

Email

Image

Data Marts

Clickstream

Digital Core Data Data Data Data Discovery Refinement Enrichment Governance People Engagement

Docs

Transactions Databases Data Warehouses

Network and Spend Management

Social

Disparate Data Sources

Streamlined Data Operations

Data Consumption

Figure 1.2 Reimagining Your Distributed Data Landscape with SAP Data Hub

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

8

1 Why SAP Data Hub?

SAP Data Hub provides the following capabilities: 쐍 End-to-end enterprise application integration (SAP S/4HANA and SAP

C/4HANA solutions, SAP SuccessFactors, SAP Ariba, SAP BW/4HANA, etc.) 쐍 Self-learning and active data governance 쐍 Integration with SAP HANA Data Management Suite components such as

SAP HANA and SAP Cloud Platform services 쐍 Use of ML/artificial intelligence (AI) and data science models to execute

the models within SAP Data Hub 쐍 Scalable deployable models for distributed data processing with the latest

technological innovation in container and container management 쐍 Extended enterprise information management (EIM) capabilities inte-

grated with SAP Enterprise Information Management (SAP EIM) solutions 쐍 Open big data-centric architecture with open source and third-party inte-

gration 쐍 Support for multi-cloud deployments and use of SAP Data Hub as a ser-

vice capabilities or bring your own license (BYOL) on-premise or hyperscaler deployment Figure 1.3 shows how SAP Data Hub unifies data silos by connecting disparate data sources, including applications, data marts, data warehouses, databases, cloud storage for all data-driven applications, advanced analytics and BI, and automated processes with a series of data processing services. These services include data ingestion, data transformation, data enrichment, metadata management, data preparation, ML execution, and advanced algorithms.

Personal Copy for Mario Massimiliano Biffi, [email protected]

9

Stream

Subscribe

Streams Events Semi-structured/ Structured Unstructured

Mask

Custom Code

Data Consumption

Compute

Enrich Validate Compute

Ingest

Intelligent Apps

Publish

SAP Data Hub Kubernetes Docker

Refine

Disparate Data Sources

Image Processing

1 Why SAP Data Hub?

Automated Processes

Machine Learning Trigger Action

Transform

Information Catalog | Monitoring and Scheduling | Orchestration | Pipelines On-Premise

Hybrid

Cloud

Figure 1.3 SAP Data Hub Unifying Data Silos

SAP Data Hub provides a single solution for data discovery and sharing, pipelining, and governance across the landscape. Instead of the old model in which organizations tried to centralize all information in one spot, SAP Data Hub centralizes the governance, leaving the data where it is while offering process orchestration across the distributed landscape for all shapes of data, including structured, unstructured, stream, big data, and cloud storage. As we’ll explore more fully in Section 3, the three main functional areas for SAP Data Hub are as follows: 쐍 Data governance

One of the applications of metadata management is data governance, which is delivered by SAP Data Hub. The vision of SAP Data Hub is to provide complete metadata management for all types of data, which starts with discovery of the data set and publishing metadata to a catalog that can be searched by users to get insight with profiled data. The data can be exchanged further via the data pipeline or easy-to-use data preparation by business users. This metadata management provides a unified view and easily traces how data has been used and by whom, as well as helps to understand the impact of future changes throughout the data value chain.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

10

1 Why SAP Data Hub?

쐍 Data pipeline

The modeling tool uses a flow-based programming paradigm to create data processing pipelines implementing distributed “push down” processing to be executed quickly where the data resides and without centralized data, so you can accelerate and scale data projects quickly. Data processing pipelines are modeled as computation graphs, which helps with data ingestion and transformation capabilities. In this graph, nodes represent operations on the data, while edges represent the data flow. These pipelines can be triggered by changes in data, so that your business is more responsive to opportunities and threats. The modeling also provides a runtime component to execute graphs in a containerized environment that runs on a Kubernetes-based SAP Data Hub runtime. The solution is an open architecture to help customers manage data flows in modern, hybrid landscapes. SAP’s goal is to help customers with information across the data landscape, whether it resides in the cloud, on-premise, or a combination thereof, with data from SAP systems (e.g., SAP HANA) or non-SAP systems (e.g., cloud object storage [AWS S3], salesforce.com, and Hadoop). The solutions also helps drive not just analytics but also applications and master data management. 쐍 Data orchestration

SAP Data Hub offers orchestration of external and internal processes within a data pipeline. By external orchestration, a SAP Business Warehouse (SAP BW) process chain can trigger execution of a process chain on a SAP BW system. Data can be transferred on the fly from a SAP BW system into SAP Data Hub’s internal SAP Vora tables. Data pipelines can execute remote SAP Data Services jobs and SAP HANA smart data integration flowgraphs. Additionally, Spark jobs and hive queries can be submitted to Hadoop clusters with the orchestration functionality of SAP Data Hub. Internally, one pipeline can call another data pipeline.

Personal Copy for Mario Massimiliano Biffi, [email protected]

11

2 Use Cases

2

Use Cases

SAP Data Hub is built to support a plethora of operations on data in all shapes and formats from traditional relational processing to running advanced algorithms or ML and more. This section highlights three of SAP Data Hub’s major use cases. For each use case category, we describe the challenges you’ll face without SAP Data Hub, explain how SAP Data Hub addresses those challenges, and then provide a quick look at specific scenarios. We’ll also walk through a specific real-world example of SAP Data Hub’s use for each of these categories.

2.1

Internet of Things Data Ingestion and Orchestration

SAP Data Hub tackles the challenges of integrating, analyzing, and processing vast quantities of raw data and events from disparate semi-structured sources with low-level semantics and no business context. Streams of data coming from IoT devices to be processed by IoT applications often need to be persisted in SAP HANA or in a data lake. SAP Data Hub helps ingest and orchestrate the data into multiple targets as depicted in Figure 2.1.

IoT App

SAP HANA Data Streams

SAP Data Hub Data Lake

Figure 2.1 Integrating and Processing Disparate Data from Messaging Systems and High-Volume Cloud Store

Additionally, SAP Data Hub enables you to solve the point-to-point challenge of distributed heterogeneous environments spanning messaging systems, cloud stores, SAP data management solutions, and enterprise applications.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

12

2 Use Cases

Finally, its event-driven pipelines scale to execute a very large number of pipelines in parallel, enabling the mass processing and orchestrating of many use cases (or variants of a single use case) at the same time while keeping a unique single-entry point to all your data needs. Let’s now take a look at some of the challenges faced by organizations in this space and how SAP Data Hub addresses these challenges.

Current Challenges Most companies can already take advantage of stored big data and data generated from smart devices, but they are unable to fully exploit the potential of such significant data assets or turn them into improved processes that can drive future profitability and productivity gains. These companies face a challenge to derive actionable insights from this vast quantity of raw data that has low-level semantics and close to no business context. The lack of standard tools or technology to access all these heterogeneous data repositories and data streams makes the amount of manual work more complex and requires specific knowledge to execute and leverage this type of data. The inability to transfer or reuse technologies and knowledge investments to solve new business needs compounds the problem.

Solution SAP Data Hub addresses these challenges by integrating and processing disparate data from messaging systems, high-volume cloud stores, and SAP data management solutions and enterprise applications in a single and unified way. It uses the same unique, event-based pipelining capabilities that are built to scale to execute thousands of concurrent processes in parallel at any time across highly distributed landscapes.

Personal Copy for Mario Massimiliano Biffi, [email protected]

13

2 Use Cases

A few examples of successful customer scenarios in IoT data ingestion and orchestration include the following: 쐍 Real-world performance information from Internet-enabled devices,

such as appliances 쐍 Logistics and supply-chain optimization with on-the-fly context-aware

rerouting and re-planning 쐍 Digital twin concept to enable simulation of possible actions, based on

ML, for individual devices or a group of devices or machines 쐍 Smart manufacturing and predictive maintenance 쐍 Warehouse and distribution center goods movement and picking optimi-

zation 쐍 Good replenishment and reordering efficiency 쐍 Transportation, order consolidation, and returns optimization

These examples are typical of SAP Data Hub’s capability to easily collect terabytes of new sensor or machine data per day from millions of connected devices. The data is automatically processed using advanced ML techniques to easily identify the associated customer or product that the organization needs to act on. SAP Data Hub allows you to automatically merge this extracted information with your core company business data so it can be directly leveraged for immediate action to improve the quality of manufactured products, increase customer satisfaction/experience, integrate customer feedback in the design or improvement of manufactured products, or improve business services. SAP Data Hub unites data from messaging systems, cloud storage, and SAP’s data management solutions and enterprise applications, along with event-based pipelines that are scaled to execute many pipelines in parallel. These capabilities help you address data ingestion and building IoT applications in which users can easily refine the business value from data ingestion to enterprise applications via a simple and unique visual modeling environment. With this environment, users can access and manage governed data

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

14

2 Use Cases

to orchestrate, refine, and schedule automated data-driven processes that enable real-time action and decision-making.

Manufacturing Case Study SAP Data Hub is being used in almost all industries to process data in different formats for IoT applications and ML scenarios. This particular example is in a manufacturing industry and involves appliance sensor data. A manufacturer of home appliances wants to gain more insight from IoT sensor data from their appliances, including dishwashers. A dishwasher often offers several wash cycles in its repertoire. Each cycle differs in speed, water temperature, pressure, and the number of washes and rinses it emits. The combined IoT data and enterprise master data can be used as a source for product quality analysis, maintenance activities, design changes, customer engagement, and so on. An enterprise-wide data platform for running, analyzing, and optimizing processes is another goal for the company. An enterprise silo holds the device master data, and an operational silo manages the sensor data. This setup can hinder the organization from connecting the two data sets. The existing architecture consists of a large stack of individual tools. Lots of manual steps and coding are required to collect, process, and consume data from both silos. There is a lack of end-to-end monitoring, scheduling, and automation in the existing solution. The overall process can be visualized with three major blocks, as shown in Figure 2.2. Data Collection

Data Utilization

Data Analysis Product Ideas

Data Lake

Customer

Dishwasher

IoT Collection

Data Analyst

Customer Data

Design Thinking

Product Mgmt.

CustomerOriented Dishwasher

Figure 2.2 Solution Steps in Customer Insight IoT Ingestion and Processing

Personal Copy for Mario Massimiliano Biffi, [email protected]

15

Production

2 Use Cases

The end-to-end process includes data collection, data analysis, and data utilization. The following section describes these three steps: 쐍 Data collection:

– Collect sensor/machine data. – Collect customer behavior data. 쐍 Data analysis:

– Cleanse, enrich, and improve the quality of the sensor/machine data. – Analyze the sensor/machine data for pattern detection and knowledge acquisition. – Determine which customers use which wash cycle (program). 쐍 Data utilization:

– Receive customer-driven product insight. – Consider the analyzed data for improving the products regarding design/functions and production. – Provide customers with a direct influence on the product quality. Figure 2.3 shows the landscape before SAP Data Hub is used, which is built around diverse tools running standalone and often in command line. These tools have no any end-to-end automation for development and production with necessary lifecycle capabilities to manage models, end-to-end monitoring of the production landscape, or an easy way to enforce access policy, audit, and report violations of the access policy. Programming and Scripting

Automated Dashboard

Collect

Land

API Push

Amazon Web Services | S3

Transform Spark

Python

Present SAP BW

SAP Vora

Figure 2.3 Solution before SAP Data Hub

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

16

2 Use Cases

The entire process of data collection, enhancement, analysis, and consumption is an end-to-end automation with SAP Data Hub as depicted in Figure 2.4. The goals have been to create end-to-end management from data ingestion to visualization and analytics in SAP BW-based dashboards and reports as well as automate the overall management and orchestration of data processes, including Spark scripts and Python processes.

Discovery Predictive Machine Learning

Department

Collect

Land

Transform

Present

Automation (Scheduling/Monitoring/Governance) Dashboard Predictive Machine Learning

IT

Monitor

Metadata

SAP BW

SAP Vora API Push

Amazon Web Services | S3

Spark

Python

Figure 2.4 Solution with SAP Data Hub

Using SAP Data Hub helped with the ingestion of raw sensor data in AWS S3 with SAP BW data, supported data refinement through custom procedures, and enabled data access for data scientists, including access to SAP BW data. The value addition of SAP Data Hub included end-to-end data flow modeling, execution, monitoring, and management in a single environment; outof-the-box integration with SAP BW process chains; full automation of the end-to-end process with data discovery; and a data catalog. The organization in this use case benefited from improved performance of predictive models with a significant reduction in manual steps. The company was also better able to thoroughly analyze customer behavior and gain valuable insights into product usage.

Personal Copy for Mario Massimiliano Biffi, [email protected]

17

2 Use Cases

2.2

Data Science and Machine Learning Data Management

SAP Data Hub can access, discover, assess, prepare, harmonize, and integrate data from all sources to increase effectiveness of artificial intelligence (AI) or ML algorithms. It integrates a variety of data sources with open data landscape management. SAP Data Hub helps to prepare data coming from SAP and non-SAP applications and then runs ML models on the prepared data to perform the end-to-end processes outlined in Figure 2.5. SAP Data Hub Data Prep

Machine Learning

App Data Lake App

Figure 2.5 AI/ML Production with SAP Data Hub

One of the most time-consuming activities in building ML models is the sourcing and preparation of data to be used in the model definition and training. In some cases, this can consume 80% of the data scientists’ time in a project. SAP Data Hub enables data scientists to select the right data for their project and reuse the existing metadata, thus reducing the data acquisition time significantly. Furthermore, data scientists can process ML models by leveraging many SAP and non-SAP engines within the same modeling tool to use and integrate any type of technologies without having to learn new techniques. Instead, they can fully productive right away. Finally, SAP Data Hub quickly and easily operationalizes ML modeling outcomes back into enterprise processes. Therefore, by dramatically decreasing the time needed to collect, refine, and orchestrate data to run AI/ML

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

18

2 Use Cases

algorithms end to end within the same tool, SAP Data Hub directly integrates with the enterprise processes and facilitates collaboration between data scientists, data engineers, and data stewards. This reduces the amount of time to translate a model to use it in a productive environment and leverage its results. The ML algorithms could be executed within the pipeline on data “in-flight” or executed within a processing engine external to the pipeline but orchestrated by the pipeline. Let’s now take a look at some of the challenges faced by organizations in this space and how SAP Data Hub addresses these challenges.

Current Challenges Enterprise customers encounter a series of challenges to run AI/ML in production landscapes with diverse data coming from various sources. The challenges include the following: 쐍 The data science projects must be ported from the prototype stage to the

production stage, and identifying and refining all the relevant data assets needs to be done in a reliable and cost-efficient way. 쐍 Data scientists spend most of their time—over 85% in some cases—on

collecting, preparing, harmonizing, shaping, refining, cleaning, and organizing data, instead of developing advanced ML algorithms. This can have a great impact on the effectiveness of ML projects by extending their duration or reducing the number of models and simulations (or variants of models) that may be executed within the project timeline. Additionally, the subsequent operationalization of the ML solution can be affected due to the disconnect from existing data sources and flows. 쐍 Point-to-point integration architecture with the variety of data sources

required for data science projects often results in using multiple technologies or an increase in the time required to execute a successful project due to the need for multiple teams across the enterprise to coordinate, plan, provision, and deliver the data and metadata. 쐍 ML algorithms need to be integrated into enterprise processes and

democratized beyond the data science team.

Personal Copy for Mario Massimiliano Biffi, [email protected]

19

2 Use Cases

Solution SAP Data Hub addresses these challenges by running AI/ML in the production landscape via a series of capabilities, including out-of-the-box integration to different kinds of data sources (structured, unstructured, streaming, etc.) in a single streamlined tool across a distributed landscape. You can build ML models by leveraging many SAP and non-SAP ML engines within the same tool. SAP Data Hub quickly and safely operationalizes ML outcomes back into enterprise processes, reducing the time needed to leverage the value of the data and enabling automation of model update needs. The time needed to collect, refine, and orchestrate data for successful adoption is dramatically reduced. You can implement SAP Data Hub AI/ML solutions in various industries: 쐍 Insurance industry risk profiling 쐍 Credit analysis and automated scoring models 쐍 Machine failure prediction leading to automated preventative mainte-

nance 쐍 Financial and supply chain fraud detection 쐍 Customer churn, propensity to rebuy or return 쐍 Resource allocation or purchase optimization 쐍 Real-time financial risk assessments on customers

SAP Data Hub connects to messaging systems such as Kafka, Message Queuing Telemetry Transport (MQTT), Network Address Translation (NATS), Web Application Messaging Protocol (WAMP), and SAP Cloud Platform Integration in addition to out-of-the-box connectivity to big data and native cloud stores, such as Azure Data Lake (ADL), Google Cloud Storage (GCS), Hadoop Distributed File System (HDFS), Amazon Simple Storage Service (Amazon S3), Windows Azure Blob Storage (WASB), or WebHDFS. In addition, it offers connectivity to SAP and non-SAP enterprise applications and databases, and the list is expanding rapidly. Additional data sources can be connected via SAP Data Services and base operators.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

20

2 Use Cases

SAP Data Hub allows you to run SAP and non-SAP ML algorithms, including TensorFlow, SAP Leonardo Machine Learning Foundation, Spark ML, R, and Python, which is an ideal target landscape that includes both a big data store and SAP HANA. SAP Data Hub also offers one unified solution to process ML and advanced analytics algorithms on any mix of engines, both SAP (SAP HANA Predictive Analysis Library [PAL], SAP Leonardo Machine Learning, etc.) and non-SAP (Python, R, Spark, TensorFlow, etc.). This solution handles data ingestion and preparation from any source of any type to solve point-to-point challenges and easily infuse ML and predictive analytics into any target business processes.

Chemicals Case Study Our next use case is from a major chemical company that faces the challenge of frequent customer churn and determining customer behavior. The cause of the churn must be determined, followed by proactively reaching out to customers to offer promotions. This process should also help to identify repeat-buy opportunities. However, fragmented data silos with sales orders in SAP HANA and customer purchasing history in SAP ERP added complexity in accessing the necessary data. Reusability and modularity of ML processes have been very slow. The customer wants a centralized data management environment to manage and orchestrate data sources and execute ML algorithms with flexible integration with ML libraries to use a variety of algorithms and reusable data flows for different scenarios. Figure 2.6 shows the solution architecture with SAP Data Hub, which offers cataloging of sales orders, integration of Python- and R-based ML algorithms, and end-to-end data management in one place. SAP Data Hub provides a high degree of automation, including deployment of scripts and algorithms; end-to-end data flow modeling; execution, monitoring, and management in a single environment; and out-of-the-box integration with SAP HANA for enterprise data integration.

Personal Copy for Mario Massimiliano Biffi, [email protected]

21

2 Use Cases

SAP S/4HANA OCEAN_SALES_ORDERS_ALL_MASTER_CA Analytical Tool

Train Model Pipelines

Serve Model Pipelines

• Read and prepare data from connected system

• Consume ML models from repository • Expose prediction services via REST endpoint and WebUI

• Fit and deploy ML models in repository

Pipeline Repository Pipeline Engine

Pipelines (Graphs) Operators Docker Files Models

Pipeline Engine

Pipeline Engine

[Executor 1]

[Executor N]

Operator 1

Operator 3

Operator 1

Operator 4

Kubernetes

Figure 2.6 Architecture with SAP Data Hub

The organization benefits from the prediction of customer behavior for both churn and repeat-buy scenarios with reuse pipelines for different scenarios by exchanging operators, as shown in Figure 2.7. SAP Data Hub helped the company understand why and when individual customers were going to churn. By providing timely churn prediction and addressing root causes, SAP Data Hub enabled the company to proactively reach out to the customers and offer promotions to avoid churn. SAP Data Hub could load customers’ data from SAP S/4HANA and leverage predictive models. It also offered version control and data science model lifecycles for execution in a production landscape. SAP Data Hub offers a scalable platform for productization of ML scenarios with a high degree of modularity and extensibility. Pipelines can be reused for different scenarios by exchanging operators, data sets can be provided by different business units, can custom operators can be used to integrate existing code/libraries. A flow-based programming model provides visual

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

22

2 Use Cases

representation and documentation of data processes that can be orchestrated centrally, instead of having individual scripts scattered across the landscape.

Figure 2.7 Modular and Extensible Pipeline for Customer Churning Analysis with SAP Data Hub

2.3

Intelligent Data Warehouse

SAP Data Hub enables you to connect to new data sources with previously siloed data from traditional data warehouses, data marts, enterprise applications, and big data stores. This allows you the opportunity to easily and rapidly integrate and leverage new data sources, as shown in Figure 2.8.

Personal Copy for Mario Massimiliano Biffi, [email protected]

23

2 Use Cases

SAP Analytics Cloud

SAP BW/4HANA

SAP HANA

SAP Data Hub Data Lake

Figure 2.8 Rapidly Integrating and Leveraging New Data Sources in Intelligent Data Warehouse

Furthermore, it combines all types of sources, including structured and unstructured data, and enables a large variety of processing on them so you can address any type of processing and integration needs. SAP Data Hub also addresses the fact that the modern data landscape will be a combination of on-premise and multiple cloud solutions. This is also true for the modern intelligent data warehouse, highlighting the need to build data pipelines that span all layers of the landscape. Finally, SAP Data Hub seamlessly processes large data sets across highly distributed landscapes and close to the data source, moving only highvalue data and avoiding unnecessary data movement and replication. This helps to reduce data quality errors and minimizes overall technical effort to build trustworthy data repositories. Let’s now take a look at some of the challenges faced by organizations in this space and how SAP Data Hub addresses these challenges.

Current Challenges The major challenge in intelligent data warehousing is the missing link between big data and enterprise data. While enterprises have invested a lot to build, manage, update, and maintain enterprise master data, the big data

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

24

2 Use Cases

repository is disconnected from the main data fuel of the enterprise’s businesses and is failing to easily integrate these available insights as part of the main decision-making process. Furthermore, the lack of enterprise readiness of big data solutions creates costly and inefficient data preparation workflows. Likewise, a lack of agility in supporting new business needs and limited tools and high efforts to productize complex data scenarios across data landscapes hinder enterprise customers from building intelligent data warehouses.

Solution SAP Data Hub uses one unified tool to rapidly integrate and leverage new data sources, as well as previously siloed data from traditional data warehouses, data marts, enterprise applications, or big data stores. It processes bulk data close to where it resides, minimizing lift and shift, and seamlessly moves only high-value data across distributed landscapes when necessary. SAP Data Hub combines all types of sources, including structured, unstructured, and streaming data. It allows flexible data enrichment and data preparation, and it enables data discovery, semantic analysis, data pipeline acceleration, and data flow auditing. Following are a couple of examples of successful customer projects in building intelligent data warehouses: 쐍 A merger ended successfully due to the ability to integrate data from two

organizations quickly and efficiently and overcoming challenges of overlapping and duplicated data, structuring the same information in different ways using different technologies, using multiple and diverse processes to leverage the data differently but for the same need, harmonizing and reconciling the data structure, and using business processes to leverage them properly and accurately. 쐍 Actionable insights were created by combining data from big data stores,

such as Hadoop, and data in structured and highly governed data warehouses to extend and improve the existing business processes as well as generate new insights.

Personal Copy for Mario Massimiliano Biffi, [email protected]

25

2 Use Cases

SAP Data Hub not only helps to build the data pipeline graphically but also manages the execution of the pipeline with distributed data and data processing native to the data sources to allow federated push-down capabilities. This distributed processing allows companies to complete the execution of a pipeline quickly to deliver rapid business outcomes.

Retail Case Study A retail company faced the challenge of how to use social media data collected from various sources with a business warehouse (BW) to build an intelligent warehouse. A traditional BW is built with transactional data from enterprise applications. Intelligent warehouses require data beyond applications, such as customer behavior data often available in social media, machine data collected by sensors, and big data. Combining application or enterprise data with big data will create an intelligent warehouse solution. The company needs to combine the refined big data with the enterprise data and corporate master data and then extract or federate the data into SAP HANA or SAP BW/4HANA for decisions based on both social media (big data) and enterprise data. The requirements are shown graphically in Figure 2.9. Productize

Enterprise data warehouse

Automate

Integration

Big data

#123

10101

Processing

Preparation

Raw data #123

10101

$

Ingestion

%&?§ §

Figure 2.9 End-to-End Data Processing Requirement from Big Data with Enterprise Data

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

26

2 Use Cases

Figure 2.10 shows that the data received from various social media applications is ingested into AWS S3 and is then processed further within HDFS for frequently required data processing, including parsing, anonymization, cleanse, lookup, and join. The processed data is written into SAP Vora, which can be accessed within SAP HANA either by replication or data federation via SAP HANA smart data access. At the end, the social media data is ingested, processed, and loaded into SAP HANA to be combined with enterprise data in an end-to-end data pipeline.

Access

Orchestration and Data Refining

Store and Process Model

Master Data

Transform

SAP Analytics Cloud

SAP HANA

SAP Vora

SAP Predictive Analytics/ Spark Scala Python

Console/ ThirdParty

Hadoop (HDFS)

S3

Extract Federate

SAP HANA or SAP BW/4HANA

Load Join Filter Cleanse Lock-up Script Mask Anonymize Parse

SAP Data Hub

Stream Copy Batch

Figure 2.10 Integration of Social Media Data into a Business Warehouse with SAP Data Hub

Personal Copy for Mario Massimiliano Biffi, [email protected]

27

3 Architecture and Integration

3

Architecture and Integration

This chapter outlines the SAP Data Hub architecture details and how SAP Data Hub is envisioned to be deployed everywhere—on-premise, public clouds, and private clouds. We’ll discuss how SAP Data Hub is and will be integrated with SAP and non-SAP products to extend the functionalities of installed products. Some products are already integrated with SAP Data Hub, and others are at different stages of integration.

3.1

Product Architecture

The fundamental architectural design of SAP Data Hub is based on serving actual and future requirements in processing massive data of different types, quality levels, and storage systems in a distributed data landscape for various use cases, as outlined in the previous section. Following are the drivers for the SAP Data Hub architecture: 쐍 Use state-of-the-art technology and concepts in distributed computing. 쐍 Leverage open infrastructure and integration capabilities. 쐍 Offer a solid foundation for new data-driven applications. 쐍 Leverage processing power where data resides. 쐍 Harvest massive parallelization. 쐍 Process mass of data at scale.

The SAP Data Hub 2.5 architecture (and the subsequent 2.x release) is presented in Figure 3.1. Container and container management tools, such as Docker and Kubernetes, are the technical foundation of the SAP Data Hub distributed runtime. Kubernetes is an open-source system to automate deployment of containerized applications and manage clusters to scale.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

28

3 Architecture and Integration

Connected Systems SAP Data Hub

Metadata Management

Kubernetes Docker

Self-Service Data Preparation

Pipeline Development

Data Workflows

Access Governance

API Access

SAP BW/4HANA Pipelines and Workflows

Distributed Runtime

Scripting (JS, Python)

Relational

Templates SAP Vora Engines

Metadata and Applications

Built-In Connectors

Scheduling

Custom Operators

Metadata Catalog

Flow-Based Applications

Profiling and Discovery Connectivity

Application Services

SAP Data Hub System Management Multitenancy

Data Storage Cloud/ On-Premise

SAP S/4HANA

User and Access Management

Content Lifecycle Management

Cloud Stores AWS S3, GCP GCS, Azure ADL, and WASB

Cluster Management

Hadoop HDFS (optional)

SAP Data Hub Adapter

SAP Data Services SAP LT Replication Server SAP HANA Databases

Diagnostics SAP cloud applications (API-driven)

SAP Vora Spark Extensions

Open connectivity for third-party and open source

Figure 3.1 SAP Data Hub Architecture for the 2.x Release

All the SAP Data Hub components are containerized, which provides the following benefits: 쐍 High scalability 쐍 High resilience (self-healing) 쐍 Different deployment options

The SAP Data Hub architecture diagram includes UI components at the top, runtime components in the middle, and a set of system management services and connectivity to SAP and non-SAP products. For end users, SAP Data Hub offers UIs for metadata management, including catalog, self-service data preparation, data modeling, user and policy management, and few other services. The three runtime components of SAP Data Hub are as follows: 쐍 SAP Vora database 쐍 Pipeline engine 쐍 Application services

Personal Copy for Mario Massimiliano Biffi, [email protected]

29

3 Architecture and Integration

The SAP Vora database supports loading and indexing data from external stores (e.g., HDFS and S3). It partitions data based on user-defined partitioning functions and supports evaluation of SQL-like queries on the loaded (and partitioned) data. SAP pipeline engine supports flow-based applications as shown in Figure 3.2. Flow-based applications support graphs of connected nodes, which are operators. The operator can provide connectivity to a source and/or a target of data. In addition, an operator can perform processing operations of the relational variety, such as merging two data sets as joins or unions, running an advanced algorithm, or using machine learning (e.g., running an R or Python code within an operator).

Operation (logic)

Operation (logic)

Operation (logic)

Operation (logic)

Flow of Data Figure 3.2 Flow-Based Applications in SAP Data Hub

Data to be processed flows through a network of operators (via interfaces) where an operator is an independent computation unit. SAP Data Hub provides hundreds of out-of-the-box operators for various connectivity methods and processing of data. SAP Data Hub enables you to develop your own custom operators for connectivity to new sources and running your code within the SAP Data Hub Kubernetes distributed runtime architecture. Containers constitute the operators’ execution environments, and the operators can be grouped together to ensure that they run in the same Kubernetes pod container as deployed on the same host. Groups can be annotated to run multiple containers for scalability.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

30

3 Architecture and Integration

Application services are the third component of the SAP Data Hub distributed runtime. Services such as profiling, connectivity to external systems, and data cataloging are supported within SAP Data Hub application services. SAP Data Hub offers a set of system management services that includes multitenancy, user and access management, content lifecycle management, cluster management, and monitoring and diagnostics: 쐍 Multitenancy

Multitenancy supports handling and managing the different tenants within a cluster and preparing the cluster to be offered as a cloud service. 쐍 User and access management

User and access management provides general user and access management, as well as integration with the existing authentication mechanism. 쐍 Content lifecycle management

Content lifecycle management helps with versioning and transporting content across the development, test, and production landscape. 쐍 Cluster management

Cluster management supports various persistency and storage devices. 쐍 Monitoring and diagnostics

Built-in diagnostics for debriefing metric analytics and visualizations with Grafana and Kibana enhances diagnostic experiences for both the user and administrators.

3.2

Deployment Options

Multitenant and fully containerized with a Kubernetes base, SAP Data Hub can be deployed everywhere as long as supported Kubernetes deployment is available on-premise or in the private and public cloud. Common deployment options are as follows:

Personal Copy for Mario Massimiliano Biffi, [email protected]

31

3 Architecture and Integration

쐍 Managed Kubernetes services of the major hyperscale clouds 쐍 Private cloud and on-premise installations via certified partners 쐍 SAP Data Hub as a service (in beta, as of May 2019) on SAP Cloud Platform

To deploy SAP Data Hub, you need a Kubernetes cluster on your preferred hardware system and operating system The following deployment options are available: 쐍 In the managed Kubernetes services of the major hyperscale clouds, your

tasks are to pay for the infrastructure, deploy SAP Data Hub, and operate Kubernetes and SAP Data Hub. 쐍 For on-premise installations within your data center, your tasks are to get

hardware; install the operating system, Kubernetes, and SAP Data Hub; and operate the infrastructure and software. 쐍 For private clouds operated by SAP certified partners, your task is to select

your preferred private cloud provider, depending on the required service level and offering for operations. The preconfigured environment, including hardware, operating system, Kubernetes, and SAP Data Hub, in a private cloud or hybrid environment is provided by SAP certified partners. 쐍 The latest offering is a fully managed cloud offering for SAP Data Hub by

SAP Cloud Platform. The whole infrastructure is provided, plus all operations of the Kubernetes cluster and SAP Data Hub are done for you. Your only task is to use all functions and build amazing flow-based applications. The list of supported hyperscalers, private clouds, and on-premise providers are available in the SAP Data Hub Product Availability Matrix (PAM) document. The list is growing in every major release of SAP Data Hub, and this trend is expected to continue, providing options for companies to choose their infrastructure when deploying SAP Data Hub.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

32

3 Architecture and Integration

3.3

Product Integration

This section describes SAP’s strategy to integrate SAP Data Hub with SAP and non-SAP products to extend the capabilities of the installed products. Figure 3.3 lists a series of SAP products that can be integrated with SAP Data Hub, each of which we’ll discuss in the following sections.

SAP Agile Data Preparation SAP LT Replication Server SAP HANA smart data integration SAP Data Quality Management

SAP HANA smart data quality

SAP Data Hub

SAP Information Steward SAP Data Services

Figure 3.3 Integrating the SAP EIM Portfolio with SAP Data Hub

SAP Data Services SAP Data Services is an enterprise data management solution to integrate, transform, and improve your structured and unstructured enterprise data by primarily helping you move application data from transactional sources to data warehouses. This process includes universal data access and integration with SAP and non-SAP enterprise data sources and targets with built-in native connectors. SAP Data Services also enables you to process nativetext data to extract meaningful information from unstructured data sources. Additionally, you can standardize and correct data with ease to eliminate duplicates and identify relationships among your data. Finally, the solution allows you to produce data quality dashboards to show the impact of data quality issues across your downstream systems and to simplify your data governance by transforming all types of data with a centralized business rule repository. The SAP Data Services key use cases are traditional data warehousing initiatives driven by BI, including data migration and data quality. This extract,

Personal Copy for Mario Massimiliano Biffi, [email protected]

33

3 Architecture and Integration

transform, and load (ETL) tool is in a standalone heterogeneous landscape with a centralized, on-premise, and server-based infrastructure that focuses on relational data processing and advanced data transformations and processing (e.g., Join, SQL, data quality (DQ), etc.). Given that SAP Data Hub is a pipeline-driven data integration, operations, and governance solution for disparate kinds of data (structured, unstructured, streaming, cloud etc.), supporting both integration and processing in a distributed fashion, combining SAP Data Services and SAP Data Hub offers much more than they can provide separately. Figure 3.4 shows how these tools can offer bridge enterprise and big data silos, as follows: 1 Ingest large volumes of data (e.g., distance, pace, heartrate, location) from machine sensors by using an MQTT/Kafka operator (SAP Data Hub). 2 Refine data according to purpose, and store it in data stores (SAP Data Hub). 3 Acquire additional relevant structured data (e.g., customers, sales, behavioral, demographic) into data stores by remotely orchestrating data service jobs (leveraging existing SAP Data Services investments). 4 Apply ML algorithms (e.g., classification, clustering, and identifying outliers) on the data to discover new insights about user characteristics (SAP Data Hub). 5 Invoke the process chain to ingest the results into SAP BW/4HANA for further data analysis and reporting (SAP Data Hub).

Machine Sensor

SAP Data Hub

Streaming Refine

CRM

SAP Data Services

Machine Learning

SAP BW/4HANA

Orchestration Collect Prepare Enrich Refine Orchestrate Orchestration

Extract Database

Transform Deduplicate

Data Stores

Cloud GCS | S3 | WASM

Hadoop HDFS

Figure 3.4 Interoperability of SAP Data Services and SAP Data Hub to Combine Enterprise Data with Big Data

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

34

3 Architecture and Integration

The technical integration of SAP Data Services and SAP Data Hub is shown in Figure 3.5. It uses a special operator, SAP Data Services job, to create a graph that connects and invokes a remote data service job. The data service job reads from a supported relational database and replicates the data into HDFS. The job is created natively in SAP Data Services with SAP Data Services Designer. SAP Data Hub allows you to reuse the SAP Data Hub to orchestrate the job. Compute and Process (Data Services Node)

SAP Data Hub

SAP Data Services Engine

Use data workflow SAP Data Services operator to orchestrate a data services job. Data set

SAP Data Services

Data set

SAP HANA SDI Connectors

HDFS Data Profiling Data Integration Data Quality Many DBs SAP Data Services

Figure 3.5 Integration of SAP Data Services with SAP Data Hub

SAP Business Warehouse and SAP S/4HANA SAP BW and SAP S/4HANA together offer a data warehousing solution with the following properties: 쐍 Built predominately with structured data from enterprise systems (SAP

ERP, SAP Customer Relationship Management [SAP CRM], HR, etc.)

Personal Copy for Mario Massimiliano Biffi, [email protected]

35

3 Architecture and Integration

쐍 Standardized data models and harmonized data 쐍 Decision-making support

Data lakes are often built with Hadoop or cloud object stores and have the following properties: 쐍 Massive amounts of raw and unstructured/nonrelational data 쐍 New data types, such as sensor, web, social media, devices, and so on 쐍 Active archive for historical data

Integration of your data warehouse and data lake is key for new data sciences and data-driven IoT applications. Figure 3.6 depicts a requirement integrating a data lake with a data warehouse for both business intelligence and predictive analytics. Connecting a data warehouse and a data lake is often tricky as they are built with different technology and managed by different organizations, but both are needed by IoT applications and data sciences. SAP Data Hub offers a unique solution with its connectivity management and data pipeline to access and process data from both the sources with one tool.

Analytics Business Intelligence, Predictive, Planning

Data Lake

Data Warehouse

Hadoop

Streaming

Virtual Access

Batch (ETL)

Real-Time

Data Sources SAP, non-SAP, relational, nonrelational, on-premise, cloud

Figure 3.6 Integration of a Data Lake with a Data Warehouse to Create an Intelligent Data Warehouse

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

36

3 Architecture and Integration

SAP HANA Smart Data Integration SAP HANA smart data integration and SAP HANA smart data quality provide an in-memory, integrated EIM services approach to pull everything together in a single platform. This approach eliminates the need for separate, specialized components. This one native, unified framework supports all methods of data integration from any data source with all the transformation, governance, and stewardship built right into the solution. As outlined in the SAP Data Services section, SAP HANA smart data integration offers similar capabilities when some processing is done in SAP HANA and some processing is done in SAP Data Hub. Figure 3.7 shows that with the SAP HANA flowgraph operator, SAP Data Hub can invoke an SAP HANA flowgraph within a SAP Data Hub pipeline to execute a remote SAP HANA flowgraph.

SAP Data Hub Use the data workflows SAP HANA flowgraph operator to orchestrate an SAP HANA smart data integration/smart data quality flowgraph. Data set

SAP HANA Flowgraph

Data set

SDI Connectors

SDI Rest API

HDFS Data Profiling Data Integration Data Quality Many DBs SAP HANA

Figure 3.7 Integration of SAP HANA Smart Data Integration with SAP Data Hub

Personal Copy for Mario Massimiliano Biffi, [email protected]

37

3 Architecture and Integration

SAP Agile Data Preparation SAP Agile Data Preparation is a data-driven application that transforms data into actionable, easily consumable information by providing fast, self-service access to high-value data. It allows business users to quickly improve the value of data by discovering, prepping, and sharing it. It also optimize IT’s ability to govern how business users are preparing data by monitoring and operationalizing data access and usage. Finally, SAP Agile Data Preparation accelerates business efficiency with trusted data by helping data stewards define, assess, and improve data. This application can be used to drive more successful analytics, data migration, and master data management initiatives with data preparation capabilities for everyone without requiring any technical skills. SAP Data Hub offers two ways to integrate with SAP Agile Data Preparation. For the first approach, Figure 3.8 shows how SAP Agile Data Preparation running on a separate SAP HANA system can interoperate with SAP Data Hub running separately. Self-service and data-driven data preparation is provided for business users to leverage the SAP Data Hub power, which allows the following: 쐍 Access and browse the SAP Data Hub content. 쐍 Retrieve a sample of the data to work on. 쐍 Profile, assess, transform, shape, and enrich the data based on the sample. 쐍 Trigger and track the execution on the full data set. 쐍 Continue to work and identify the next steps.

In the second approach, SAP Data Hub supports self-service data preparation natively within SAP Data Hub. Self-service data preparation not only includes the ability to transform data but also is a larger set of capabilities that allow you to handle your data needs from a total end-to-end perspective from the raw data to the expected operationalized data set.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

38

3 Architecture and Integration

Self-Service Data Prep with SAP Agile Data Preparation

SAP Data Hub

Machine Learning Enterprise Apps

Databases Data Lakes

Figure 3.8 SAP Agile Data Preparation and SAP Data Hub Interoperability

Self-service data preparation includes the ability to access data and locate the right data set without necessarily knowing where the data is stored. You can also preview the data before taking any further action to make sure you have the correct data. Prior to shaping and refining the data, you need to understand the data and its potential data quality issues, if any, by accessing classical as well as advanced profiling information. After you identify which actions you want to perform on the data, you can shape, fix, enrich, and harmonize the data by interacting with UI components and by getting immediate visual feedback of the resulting transformed data. Finally, distributing the prepared asset is as important as being able to operationalize the flow that created it. This flow is recorded in a SAP Data Hub pipeline that also can be used for operationalization, collaboration, and maintenance of the data preparation over time.

Personal Copy for Mario Massimiliano Biffi, [email protected]

39

3 Architecture and Integration

You can do the following using self-service data preparation in SAP Data Hub: 쐍 Access the data to start preparing it. 쐍 Use the metadata explorer or the connection browser to find the data set

you’re looking for. 쐍 After locating the data set, access the menu to start to prepare the data.

Note that this process will automatically create a meaningful and representative sample of the selected data set that will give you the opportunity to interact with the sample instead of interacting with the full data set. You can run a series of operations such as the following: 쐍 Transform, shape, harmonize, curate, and enrich the data

– You can easily apply actions at the column or the data set level by interacting with the data grid as well as with the side panel. – Some actions will influence the entire data grid and may even replace its entire content. – You can apply the supported transformations. 쐍 Manage the recipe

– All the actions you can take to transform the data are recorded and listed in an object called the recipe. – This recipe allows you to see and edit some of the steps you took previously. – You can decide to reorder, enable, disable, or remove some actions. 쐍 Run and manage executions

– You can execute the defined list of refinements on the entire data set. The list of actions that are recorded in the recipe will be applied to the full original data set to produce the output you’ve defined. – You can create the output of this execution or append the content of the processed data to an existing output.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

40

3 Architecture and Integration

– Because the execution of this recipe on the full original data set can take some time, you can review and manage the execution requests, such as pausing and canceling the process execution.

SAP Information Steward SAP Information Steward is a data stewardship and data integrity solution that combines data profiling, data lineage, and metadata management to gain continuous insight into the quality of your enterprise data assets. This insight helps you understand how the quality of your data can impact your business processes to enhance operational, analytical, and data governance initiatives. SAP Information Steward enables your company information governance by facilitating collaboration among multiple personas, such as business analysts, data stewards, and IT experts, to substantially improve your company’s EIM initiatives and overall data integrity. You can monitor the data quality by getting a complete view of data quality reports and by delivering dashboards and scorecards that narrow down the profiling results to a specific data set to help you better understand how poor data affects your business. The data profiling in SAP Information Steward supports governance processes that define data ownership in accordance with business needs, roles, and policies, so you can find errors and isolate the records that need to be refined or adjusted. Additionally, data lineage and cleansing packages are used to ensure that your data assets are accurate and trustworthy by evaluating the merits of changes in your data structures and your data models. This enables you to develop custom data-cleansing solutions for any vertical and line of business domains to identify the upstream cause of data issues. Finally, SAP Information Steward enhances your operational, analytical, and data governance initiatives by automatically collecting technical and business metadata for your application repository and dictionary to retrieve and control the critical information that drives your core business. SAP Information Steward provides data stewardship to easily monitor, analyze, and improve data integrity. Continuous insight is provided into the

Personal Copy for Mario Massimiliano Biffi, [email protected]

41

3 Architecture and Integration

quality of your data assets, the costs of bad data quality, and the impacts on your business. You’ll also understand how data quality affects your processes and enhances your operational, analytical, and data governance initiatives. SAP Data Hub supports a governance solution to manage metadata for disparate kinds of data (structured, unstructured, streaming, cloud, etc.) for both integration and processing in a distributed fashion. In addition, a pipeline-driven data integration lets you integrate data and orchestrate data processing. As shown in Figure 3.9, SAP Data Hub intends to extend and complement SAP Information Steward by sharing data for new capabilities with additional sources in big data, cloud storage, and streaming data. Stewardship and Governance

Data processing and integration style Interoperability Compliance and Financial Impact

Files Applications Connection Databases

SAP Information Steward

Data Discovery and Stewardship

Metadata Explorer

BI Systems ETL Tools

Information Policy Hub

IoT Data Stream Data Exchange

Collect

Prepare

Enterprise Data Approval Workflow

ML/Predictive SAP Data Hub Enrich Data Warehouse

Unstructured Data

Refine

Data Lake/ Object Storages

Figure 3.9 SAP Information Steward and SAP Data Hub Interoperability

SAP Landscape Transformation Replication Server SAP Landscape Transformation Replication Server (SAP LT Replication Server) enriches the SAP HANA platform by allowing you to access the right information in the right place at the right time by moving data in real time between different systems within the same network, wide area networks, and in the cloud. SAP LT Replication Server replicates data in large and distributed landscapes by capturing change data with near-zero impact to reduce data transfer to target systems and by embedding the replication server as middleware that can be deployed without operational disruption. It also supports data transformation capabilities while replicating the data by converting, enriching, and reducing the target records with flexible filtering options.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

Orchestrate

42

3 Architecture and Integration

Additionally, SAP LT Replication Server ensures transactional integrity by supporting point-in-time recovery and enabling extensive logging capabilities. Finally, the solution translates complex SAP application structures immediately into transparent table structures that reduce the time and effort of manual configuration. SAP LT Replication Server is positioned for real-time (trigger-based) data replication from ABAP and non-ABAP sources (SAP NetWeaver-supported databases only) when replicating data out of SAP systems in real time into SAP Data Hub. From there, data can be consumed by Hadoop or an operator/pipeline. The minimum required release for SAP LT Replication Server is Data Migration Server (DMIS) 2018 (equivalent to DMIS 2011 SP 15), and for SAP Data Hub, it’s 2.3. For details, check SAP Note 2647941. To overcome the challenge of data being kept in silos in big data landscapes, a real-time connection is needed between big data and enterprise data, as shown in Figure 3.10. SAP LT Replication Server is enabled to participate in an end-to-end big data scenario by writing enterprise data out of different SAP sources into SAP Vora as orchestrated by SAP Data Hub. This allows you to access and work with your data across system boundaries, benefiting from a centralized data orchestration and data governance through SAP Data Hub. Enterprise systems SAP Data Hub

SAP BW SAP Landscape Transformation Replication Server

SAP S/4HANA Enterprise apps SAP HANA

Any DB

Distributed data systems Hadoop

SAP Vora

Cloud storage ML, predictive,…

Figure 3.10 SAP LT Replication Server Replicates SAP Application Data to SAP Data Hub in Real Time

Personal Copy for Mario Massimiliano Biffi, [email protected]

43

3 Architecture and Integration

SAP Data Quality Management, Microservices for Location Data SAP Data Quality Management, microservices for location data, is available in SAP Cloud Platform and provides data cleansing services, including address cleansing, geocoding, and reverse geocoding REST services offered in SAP Cloud Platform. Address cleansing is available for more than 240 countries, and geo capabilities are available in many countries as well. This service facilitates easy integration of partner/customer solutions with SAP applications. This SAP Data Quality Management solution is now available in a pay-as-you-go model or sold as a standalone service in SAP Cloud Platform. Figure 3.11 shows that SAP Data Hub offers an operator to call SAP Data Quality Management, microservices for location data, within a SAP Data Hub pipeline. 2. Model the pipeline to process data through SAP Data Quality Management, microservices for location data.

1. (Optionally) Model microservices configuration for use with SAP Data Hub.

Cloud

3. Execute the process.

SAP Data Hub

SAP Data Quality Management, microservices

Model

Map

Run

Configure

Figure 3.11 SAP Data Quality Management, Microservices for Location Data, Integration with SAP Data Hub

As of SAP Data Hub 2.3 and onward, you’ll find the following features: 쐍 New SAP Data Quality Management, microservices operators 쐍 Address cleanse/geocode, reverse geocode, and client operator

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

44

4 Data Management Capabilities

쐍 Prebuilt sample graphs 쐍 Support for productive/nonproductive use 쐍 Support for operator-side configuration overrides

4

Data Management Capabilities

The main entry point to SAP Data Hub is the SAP Data Hub launchpad, which allows you to launch all the SAP Data Hub applications as shown in Figure 4.1.

Figure 4.1 SAP Data Hub Launchpad

Personal Copy for Mario Massimiliano Biffi, [email protected]

45

4 Data Management Capabilities

This browser-based launchpad enables you to connect all your sources in your data landscape, extract the metadata from your connected sources, build a catalog of metadata, explore and search metadata in your catalog, manage your policies, monitor all the activities, build and schedule your pipelines, perform content lifecycle management, and perform many other tasks. Furthermore, this launchpad can be personalized based on your roles so it suits your needs and your work by assigning applications as favorites or displaying applications in groups. This section describes the major functionalities available in SAP Data Hub through the SAP Data Hub launchpad, which we’ve broken out into the following three categories: data pipeline, data orchestration and workflows, and governance. We’ll then walk through a more practical example of how SAP Data Hub can be used for a specific process, making use of many of the capabilities already discussed.

4.1

Pipeline

The SAP Data Hub launchpad, as shown previously in Figure 4.1, gives you the ability to launch the SAP Data Hub modeler to create your data processing pipelines. The SAP Data Hub modeler can help you design and execute your data-driven use cases.

Concepts and Terminology The operator is the fundamental concept of a data pipeline. This single computation unit with input and output ports executes a data operation within the landscape, such as a connection to read from a source, a connection to write to a source, or an operation to process the data. Data flows are orchestrated through an operator as well. A graph of operators or a set of connected operators is called a pipeline and is considered a network of operators connected to each other using typed

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

46

4 Data Management Capabilities

input ports and output ports for data transfer and data processing where you can define and configure the parameters at every operator’s level. Pipelines can be executed immediately or can be scheduled for recurrent execution. Figure 4.2 shows a representation of operators and a pipeline.

SAP Data Hub Operator Operator

Operator

Pipeline

Figure 4.2 Operators and a Pipeline in SAP Data Hub

Building Data-Driven Applications with Operators An operator is associated with a port that can be any of many data types, such as string, blob, int64, float64, byte, message stream, and so on. A port can be used as an input or output to an operator, and operators are connected at ports based on the port type. Events, like messages, are delivered to the input ports to react to events from the environment, and the operator interacts with the environment through the output ports. Color codes identify compatible port types, and the output and input ports are compatible when the base type names of both ports match. Figure 4.3 shows an S3 operator that consumes incoming string events and can send output in two different formats: one as a string and another as a message stream. SAP Data Hub includes the SAP Data Hub modeler, which is a tool based on the SAP Data Hub pipeline engine that uses a flow-based programming paradigm to create data processing pipelines (graphs). Big data applications require advanced data ingestion and transformation capabilities.

Personal Copy for Mario Massimiliano Biffi, [email protected]

47

4 Data Management Capabilities

Input ports

S3

S3 Consumer

Output ports

Figure 4.3 An Example Operator with Input and Output Ports

When launching the modeler, as shown in Figure 4.4, you can easily design your data processing pipelines using existing graphs or the available operators. The left-hand side of the application provides access to all the objects you can interact with to create your pipelines, while the main screen stores the pipelines and allows you to execute and debrief the results.

Figure 4.4 SAP Data Hub Modeler Main UI

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

48

4 Data Management Capabilities

With a newly created graph, you can address common use cases using different operators. For example, you can ingest data from source systems as shown in Figure 4.5 and Figure 4.6.

Figure 4.5 SAP Data Hub Modeler Connectivity Operators

By selecting and then dragging and dropping the available operators, you can ingest SAP HANA data, message queues (e.g., Apache Kafka), or data storage systems (e.g., HDFS or S3).

Personal Copy for Mario Massimiliano Biffi, [email protected]

49

4 Data Management Capabilities

In addition, other operators, such as the Connectivity (via Flowagent) operators shown in Figure 4.6, allow the direct consumption of data from table structures or by using scripting query languages.

Figure 4.6 SAP Data Hub Modeler Flowagent Connectivity Operators and Data Quality Operators

You can easily use out-of-the-box operators to improve your data with data quality operators such as address cleansing, geocoding and reverse geocoding transformation, apply data masking, and perform anonymization, as shown in Figure 4.6.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

50

4 Data Management Capabilities

Building pipelines can help you transform the data to a desired target schema, and then store the data in target systems for consumption, archiving, or analysis. These analyses could be running ML or advanced algorithms with the processed data. But you can also integrate external scripting code to your processing pipeline, such as JavaScript, Python, or R, as shown in Figure 4.7.

Figure 4.7 SAP Data Hub Modeler Processing Operators

In addition to the operators we briefly introduced previously, the modeling tool offers hundreds of predefined operators.

Personal Copy for Mario Massimiliano Biffi, [email protected]

51

4 Data Management Capabilities

Figure 4.8 lists only a subset of the operators, which are included out of the box in SAP Data Hub. A set of operators is included to directly connect to various structured, unstructured, stream, cloud, and big data sources for both read and write purposes. Connectivity via flowagent provides enhanced connectivity for a set of sources in addition to data access, replication, and metadata and data lineage. The modeler also includes operators to execute Hadoop/Spark jobs, connectivity to SAP Data Quality Management, microservices for location data on SAP Cloud Platform, and a set of SAP Leonardo Machine Learning Foundation operators. • Azure Data Lake (ADL) • Local File System (file) • Google Cloud Storage (GCS) • HDFS • Amazon S3 • Azure Storage Blob (WASB) • WebHDFS

Connectivity:

SAP Vora:

Connectivity via Flowagent:

:•

Spark/Hadoop: • Spark • Spark SQL • PySpark • Hive

Leonardo ML:

SAP Data Quality Management, microservices for location data

SAP Leonardo Machine Learning

Figure 4.8 Subset of Operators Provided in SAP Data Hub

Along with a list of predefined operators to run a process within a pipeline and provide a continuous stream, SAP Data Hub also runs a shell command for each arrival of a message within a pipeline. It allows you to write and run custom scripts for data manipulation within a pipeline and supports building reusable operators in different programming languages (e.g., JavaScript, Go Language, Python, or R) using the SAP Data Hub software development kit (SDK).

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

52

4 Data Management Capabilities

All of these standard operators, customer/partner operators, and wrap customer codes offer extensible and scalable containerized distributed production-ready graphs that can be transported across the landscape, scheduled to execute, and monitored with SAP Data Hub. For example, Figure 4.9 shows a graph that reads product reviews data from an HDFS data source 1. It then processes the data by parsing and computing sentiment analysis information 2 loaded as results into a SAP Vora table 3 to be consumed by another tool or process.

Figure 4.9 Flow-Based Programming with SAP Data Hub

Pipeline Execution at Runtime As shown in Figure 4.10, the lifecycle of a data pipeline includes three steps: visual design, model repository, and image composition. The first step, visual design, can be performed using the modeling tool that provides an intuitive design of complex data streams and transformations and offers to perform the execution while developing the pipeline.

Personal Copy for Mario Massimiliano Biffi, [email protected]

53

4 Data Management Capabilities

SAP Data Hub Modeler

Run

Build Op

Op

Op

Op

Lib

Lib

Lib

Lib

Docker Image Docker Image Image Composer

Model Repository

Deploy Container Pod1

Container Pod2

Kubernetes cluster

Figure 4.10 Runtime of a Data Pipeline

You can also monitor the execution to validate the models before using them in production, as shown in Figure 4.11. You can execute your pipeline in the header and see the execution statuses after running or once finished in the footer. The model repository, the second step in the data pipeline lifecycle, allows you to reuse graphs and operators to avoid duplicating efforts and to attain maximum collaboration between multiple types of personas. Additionally, the repository supports tag-based runtime specification as well as descriptions for the containers that are meant to be executed.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

54

4 Data Management Capabilities

Figure 4.11 Runtime Panel of SAP Data Hub Modeler

You can easily save your created graphs as repository objects so they can be reused as shown in Figure 4.12. Finally, the image composer is an internal process that allows you to choose between the containers based on operator tags where you can build new images on demand and deploy them on a Kubernetes cluster without interacting with the infrastructure directly.

Personal Copy for Mario Massimiliano Biffi, [email protected]

55

4 Data Management Capabilities

Figure 4.12 SAP Data Hub Modeler Model Repository

4.2

Data Orchestration and Workflows

SAP Data Hub orchestrates external processes so that their execution can be linked together in a more complex data processing pipeline. This allows you to orchestrate and execute multiple tasks in a specific given order.

4.2.1 Concepts and Terminology Figure 4.13 introduces a special operator called the workflow operator, which allows communication with external systems to orchestrate and schedule external processes. For example, the workflow operator allows you to execute a remote SAP Data Services job with a data service operator directly from SAP Data Hub.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

56

4 Data Management Capabilities

SAP Data Hub Operator Operator

Operator SAP BW Process Chain

Workflow Operator

Workflow Operator

Data Services

Pipeline

Figure 4.13 Workflow Operator and Orchestration Concept

Workflow pipelines let you orchestrate multiple tasks and execute them in a given order. Workflow operators run for a limited period, finish either with Success or Failure status, and then are connected to the latter operator to start execution.

Data Orchestration An example of an orchestration is shown in Figure 4.14. In it, once triggered, first master data is updated in SAP BW via a process chain 1 which then triggers a request to run an SAP HANA flowgraph 2. Orchestration in general can be performed with external processes triggered within SAP Data Hub, which is called external orchestration. Following are examples of external orchestrations: 쐍 SAP BW process chain workflow operator

Triggers execution of a process chain on a SAP BW system. 쐍 Data transfer (SAP BW) workflow operator

Transfers data from a SAP BW system into SAP Vora tables created on the fly.

Personal Copy for Mario Massimiliano Biffi, [email protected]

57

4 Data Management Capabilities

쐍 Data services workflow operator

Executes remote data services jobs. 쐍 SAP HANA flowgraph workflow operator

Triggers execution of an SAP HANA flowgraph using the SAP HANA smart data integration RESTful application programming interface (API). 쐍 Spark/Hadoop operator

Allows the submission of Spark jobs, Hive queries, and so on to Hadoop clusters.

Figure 4.14 Data Orchestration with Two Workflow Operators

In addition, there are a set of internal workflow operators that start a pipeline on a local or remote SAP Data Hub pipeline engine. The data transform workflow operator runs relational transformations (join, union, filter, etc.) on structured data (tables, CSV, Parquet, etc.). Figure 4.15 shows the “orchestrator,” which is a pipeline consisting of multiple data workflow operators.

Design Time

Runtime

Data

Data

Data

SAP BW

SAP BW

Data

Data

SAP Vora

Data

Data

Data

SAP HANA

Figure 4.15 Data Workflows in Design Time and Runtime

The heavy lifting/logic execution happens either internally (in the operator container) or externally (in the connected system). In this example, connected

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

58

4 Data Management Capabilities

systems are SAP HANA and SAP BW. The design time is translated into runtime containers. The connected systems are external systems, such as SAP HANA, SAP BW, and SAP Data Services, which are connected with SAP Data Hub for accessing data within the data pipeline, metadata, and monitoring.

Data Transformation and Data Quality Operators The graphical editor enables you to build data transformation processes without coding and allows data transformation on structured data stored in Hadoop or cloud storage. Data transformation and data quality operators are supported on multiple and combined data sources and data targets, as shown in Figure 4.16 and Figure 4.17.

Figure 4.16 Data Transformation Pipeline

Personal Copy for Mario Massimiliano Biffi, [email protected]

59

4 Data Management Capabilities

You can easily choose Projection, Aggregation, Join, Union, or Case without coding, as shown in Figure 4.17.

Figure 4.17 Data Transformation Operators

Additionally, SAP Data Hub provides data quality operators to cleanse, anonymize, enrich, and validate data assets. You can mix and match these operators in the data pipeline to improve the quality of the data that needs to be processed. Data cleansing and enrichment services are available in SAP Data Hub by offering integration with SAP Data Quality Management, microservices for location data, on SAP Cloud Platform. There are three new SAP Data Quality Management, microservices for location data, operators for this service. The address cleanse is used to prepare user data for address cleanse and/or geocoding requests that can be passed to the SAP Data Quality Management, microservices for location data, client operator. In addition, reverse geo is used to prepare user data for reverse geocoding requests that can be passed to the SAP Data Quality Management, microservices for location data, client operator. Finally, the client is used to send requests to SAP Data Quality Management, microservices for location data. The data masking functionality, which can be used to safeguard your critical business assets, allows you to protect sensitive personally identifiable data as part of the industry regulatory compliance. You can also randomize sensitive production data for your test data management initiatives.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

60

4 Data Management Capabilities

The anonymization operator is used for information privacy protection. It removes any personally identifiable information from data sets to keep the people whom the data describe anonymous. The anonymization operator leverages multiple capabilities parameters such as the sensitive, the nonsensitive, the quasi-identifier, and the identifier information: 쐍 The sensitive parameter is used to preserve privacy in data sets by reduc-

ing the granularity of the data representation. 쐍 The nonsensitive parameter is used to declare that the field doesn’t con-

tain sensitive information. 쐍 The quasi-identifier is used to define equivalency classes. 쐍 The identifier provides unambiguous reidentification of the individual to

which the record refers. SAP Data Hub provides a large list of operators dedicated to data scientists for running ML algorithms within SAP Data Hub. The list of such dedicated operators grows with every release of SAP Data Hub. The out-of-the-box operators are categorized as follows: 쐍 Connectivity

Connectivity to big data and cloud storage, including Local File System (File), Azure Data Lake (ADL), Google Cloud Storage (GCS), Hadoop File System (HDFS), Amazon S3, Microsoft Azure Blob Storage (WASB), and WebHDF sources. It supports connectivity to relational databases, including SAP HANA and third-party databases. It offers connectivity to various messaging systems, including KAFKA, NATS, WAMPS, MQTT. 쐍 Converter

Conversion of data types is included. 쐍 Computer Vision

A set of Python OpenCV operators for processing images and video are available at the beta stage.

Personal Copy for Mario Massimiliano Biffi, [email protected]

61

4 Data Management Capabilities

쐍 Data Quality

Data masking, anonymization, validation rule, and integration with SAP Data Quality microservices are grouped as data quality operators in SAP Data Hub. 쐍 Machine Learning Examples

Open source ML algorithms are bundled as examples. 쐍 Machine Learning Utilities

Chunk data are offered as utilities for creating dashboards. 쐍 Machine Learning Predictive Analytics

Predictive analytics are bundled as beta. 쐍 Hadoop/Spark

Spark-submit and Livy Spark-submit are included under Spark. 쐍 SAP Integration

Integration with SAP Cloud Platform Integration and SAP Process Integration are offered here out of the box. 쐍 SAP Vora

Processing with SAP Vora includes loading to SAP Vora and text analysis. 쐍 SAP Leonardo Machine Learning Foundation

Various ML algorithms offered from SAP Leonardo Machine learning Foundation are available. 쐍 Language Support Operators

Support for Golang, R, Python2, Python3, JavaScript, and so on are supported. The data science and ML operators can be categorized in three different groups. The first group is the list of processing operators that allows you to use existing code or develop new code in several languages, such as R, Python, Go, or JavaScript. The second group offers the operators to leverage your data science projects by allowing you to produce or use models to integrate the dashboard. The third and last group is the main group that offers dedicated ML operators to process and learn from your data to create either a model or an output that can be directly leveraged or integrated as part of

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

62

4 Data Management Capabilities

your processes. It can include computer vision capabilities, TensorFlow, or Spark ML algorithms, as well as SAP Predictive Analytics operators or SAP Leonardo Machine Learning Foundation operators.

Custom Operators Even though SAP Data Hub provides hundreds of standard operators to connect to various sources, such as structured, stream, and cloud stores, and runs both relational operation and ML advanced algorithms, you still may need to build custom operators to meet your business requirements. The motivation to build a custom operator may come from a new idea where no suitable predefined operator is available, a requirement to enhance an existing operator, or a need to reuse existing code and algorithms as an operator within a SAP Data Hub pipeline. The concept is shown visually in Figure 4.18.

Base Operator

Python Operator

Go Operator

JavaScript Operator

R Client

Script Executor

Process Executor

Derived from {JSON}

Tagging

Operator Descriptor

README

Script

Library

Custom Operator debian:1.0

debian:1.1

python:1.0

jvm:latest

Docker File

Docker File

Docker File

Docker File

Docker Repository

Figure 4.18 Concept of a Custom Operator Derived from the Base Operator

Personal Copy for Mario Massimiliano Biffi, [email protected]

63

4 Data Management Capabilities

The SAP Data Hub pipeline modeler provides a couple of predefined base operators. Custom operators derived from the base operators can be extended with custom parameters, input and output ports, documentation, scripts, libraries, and Docker environments. Docker files describing the container runtime for operators are chosen based on tags. The modeler stores all artifacts in the repository. As shown in Figure 4.18, a custom operator can be written in any language, but certain languages can inherit available base operators in R, Python, Go Lang, and JavaScript. Building and using custom operators is quite straightforward and involves the following steps: 1. Create a folder in the modeler. 2. Create a new operator. 3. Customize the new operator. 4. Include the new operator in a pipeline. If you’re interested in building custom operators, SAP has released several examples publicly to GitHub at https://github.com/SAP/datahub-integration-examples.

4.3

Governance

Finding data is often difficult because it’s now managed by different business units and may reside in various data lakes or applications on premise or in the cloud. Data can also come in streaming and in various formats. This variation requires a catalog of your metadata that can be queried easily by business users in natural language without needing to know technical details on the source or format of how data is stored. After a simple search to identify/locate the right data, the next challenge is to find more insight about the data. The structure (in terms of number of columns or types of column), the patterns of data, how the data is used by others, and how they feel about the data quality are extremely important for you to know before you use the data.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

64

4 Data Management Capabilities

By crawling through data sources to gather valuable metadata and storing metadata in a centralized catalog, SAP Data Hub helps you easily understand and secure your data by providing a single entry point to get a holistic view of your entire enterprise data landscape. Furthermore, SAP Data Hub supports profiling diverse data sources to gain a deeper understanding of the data to create meaningful data pipelines, to easily find data quality issues, or to identify data discrepancies across the landscapes. You can also use centralized data access to control all orchestration, data refinement, scheduling, and monitoring in order to increase the governance of all your data. You need to make sure you can share the data with users in a way that makes it easy to find and consume, which means monitoring and enforcing policies to keep the data clean. SAP Data Hub provides answer to these requirements either with the delivered features available now or with the published road map to deliver the remaining functionality. Figure 4.19 outlines how SAP Data Hub intends to address users’ data governance requirements. The core of the solution is the metadata catalog, which is built automatically to connect all the necessary sources, including cloud storage, streaming, enterprise applications running on premise or in the cloud, and various structured and unstructured data sources. If you have any existing metadata solution, it’s likely SAP Data Hub will reuse metadata extracted already into the existing metadata solutions. Creation of metadata catalog will be supported by workflows. The metadata catalog provides the following capabilities: 쐍 Search and filter 쐍 Data discovery and profiling 쐍 Lineage analysis 쐍 Impact analysis 쐍 Modeling

Personal Copy for Mario Massimiliano Biffi, [email protected]

65

4 Data Management Capabilities

쐍 Annotation and suggestions 쐍 Rules, policies, and key performance indicators (KPIs)

SAP Data Hub – Metadata Governance (Vision) Discovery and profiling

Search

Lineage

Modeling

Automation/ suggestions

Business rules/policies/ security

Metadata Catalog

Metadata crawlers

Manual definition

Open APIs

Collaborative definition/workflows

Connected Sources SAP Data Hub sources (DBs, SAP HANA, SAP BW, object stores, EDWs, WS/APIs, Hadoop, noSQL, enterprise applications, dev platforms, APIs, SDK, etc.)

Other metadata repositories (SAP Information Steward, SAP PowerDesigner, SAP Enterprise Architecture Designer, Atlas/Navigator, Hive, APIs, etc.)

Figure 4.19 Data Governance and Data Analysis with SAP Data Hub

Easy-to-use search and filter functions are key for business users to locate data that is physically scattered across geographical and business boundaries. Users don’t need to understand the underlying technical structure to find data needed for their daily job, such as business analysts forecasting sales or data scientists getting more insight about the data. SAP Data Hub offers a simple search bar to search data sets. Data discovery and profiling of the search results are ways to get more insight from the data. Profiling not only gets the structure of the data but

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

66

4 Data Management Capabilities

also indicates how good the data is with profiling min, max, distribution, null, and so on statistical information. Additional useful information is expected from SAP Data Hub about how these data sets are consumed by various applications, including reports, and how other users are using and commenting about the data sets. Metadata lineage analysis is critical to understanding the original data and the evolution of data. Impact analysis provides information about any changes to the existing data and how changes to the existing data will impact all the applications these data sets are using. Both metadata lineage and impact analysis are critical capabilities in the SAP Data Hub metadata solution. Data annotation by labeling is useful for the end user to understand the value of the data. The capabilities to define policies on the data, including security policies, and define and enforce rules, policies, and defined KPIs are critical to keeping data clean for all consumers.

4.4

End-to-End Process

This section presents a simple end-to-end scenario using SAP Data Hub to build a basic workflow that combines sales and customer data with product review data to perform sentiment analysis. This overview omits some parameters and steps for conciseness and to give a general idea of one usage of the solution by using some of the concepts introduced earlier, such as the pipelining and metadata management capabilities.

Create Connections After you’ve logged on to the SAP Data Hub application and are in the launchpad, the first task is to create or check your connection using the Connection Management tool, as shown in Figure 4.20. This automatically launches a new tab in the browser with the Connection Management tool that lists all existing connections, as shown in Figure 4.21.

Personal Copy for Mario Massimiliano Biffi, [email protected]

67

4 Data Management Capabilities

Figure 4.20 SAP Data Hub Launchpad: Connection Management Tool

Figure 4.21 SAP Data Hub Connection Management

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

68

4 Data Management Capabilities

You can create a new connection here or select any existing or newly created connection and then test whether they are working, as shown in Figure 4.22 and Figure 4.23.

Figure 4.22 Checking the Status of an Existing Connection

Figure 4.23 Status Information on an Existing Connection

Personal Copy for Mario Massimiliano Biffi, [email protected]

69

4 Data Management Capabilities

When creating your connection, you’ll be able to connect to many different systems. You’ll also need to define whether you want this connection to be available for managing the metadata, so you can browse, discover, profile, and understand the data set available in your connected systems. After your connection is created and you’ve checked its status, you can return to the SAP Data Hub Launchpad to access the metadata explorer and start discovering your data assets.

Govern Your Connected Systems In the launchpad, you can browse and profile your data using the Metadata Explorer tool. This automatically launches a new tab in the browser with the Metadata Explorer tool that shows the metadata discovery capabilities, as shown in Figure 4.24.

Figure 4.24 SAP Data Hub Metadata Explorer

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

70

4 Data Management Capabilities

The SAP Data Hub Metadata Explorer screen helps to govern your data set and provides a simple and quick overview of what is happening with your connected systems. For example, you can see the number of indexed data sets per connected systems and access them directly. You can also get a high-level overview of the pipeline execution monitoring results. But it’s also a main entry point to browse the content of your connected systems, as shown in Figure 4.25, or to access other Commonly Used Actions.

Figure 4.25 Commonly Used Actions in SAP Data Hub Metadata Explorer

Personal Copy for Mario Massimiliano Biffi, [email protected]

71

4 Data Management Capabilities

When browsing your connections, you’ll see a list of all the data sets you can access as well as the status if these data sets are Indexed, Published, or Profiled, as shown in Figure 4.26. You can select any of these data sets to perform actions on them.

Figure 4.26 List of Data Sets on a Connected System

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

72

4 Data Management Capabilities

As shown in Figure 4.27, for example, you can perform a New Publication Action. You can also choose View Fact Sheet to see the information associated with the selected data set or choose View in Catalog to see the data set in the metadata catalog. Finally, you can start to profile the data set (Start Profiling) or to prepare the data (Prepare Data).

Figure 4.27 List of Actions on Data Sets

If you access the fact sheet of a data set, as shown in Figure 4.28, you can first view the associated metadata information to get a quick insight into the number of columns and their associated data types.

Personal Copy for Mario Massimiliano Biffi, [email protected]

73

4 Data Management Capabilities

Figure 4.28 Metadata Information Associated with a Data Set

At any point, you can choose Data Preview to see a preview of the data, as shown in Figure 4.29, to explore the data more and get a better understanding to leverage your scenario.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

74

4 Data Management Capabilities

Figure 4.29 Preview of the Data

After you’ve used the Metadata Explorer to locate the data you need to achieve your goal, one of the next steps is to create pipelines using the pipeline modeler to enhance or enrich your data or run advanced algorithms, such as ML, with your data sets.

Personal Copy for Mario Massimiliano Biffi, [email protected]

75

4 Data Management Capabilities

Create a Pipeline In the launchpad, you can browse and profile your data using the modeler tool. This automatically launches a new tab in the SAP Data Hub Modeler screen that allows you to create pipelines and execute them (see Figure 4.30).

Figure 4.30 SAP Data Hub Modeler

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

76

4 Data Management Capabilities

You can create a new graph and drag and drop existing operators to process your data according to your project. Every node in the graph can be configured, and the parameters depend on the chosen operator. In our example, as shown in Figure 4.31, we created a graph that ingests data from a flat file at a WebHDFS location, and we just streamed the output to a wiretap operator to see it.

Figure 4.31 Creating a Graph and Selecting an Operator

Personal Copy for Mario Massimiliano Biffi, [email protected]

77

4 Data Management Capabilities

When you select an operator, you can request to see all the configuration parameters, which will appear on the right-hand side panel. At any point, you can execute a graph as shown in Figure 4.32. When you click on the Play button at the top of the application a notification of the execution status appears at the bottom.

Figure 4.32 Executed Pipeline

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

78

4 Data Management Capabilities

When running a pipeline, if you have operators that can open additional UIs, you can select them and choose to launch the UI. Depending on your pipeline, some UIs will never terminate and will run if you decide to keep them running; some others may terminate based on your design. In our example, we enriched our graph with additional operators to process the data and extract sentiment information using a Python operator where we developed the coding logic. However, there could be many other ways to perform the exact same requirement, such as by using predefined operators or by using other open operators like the one we used. This graph also output the results into a SAP Vora table so they can be consumed by any other processes or third-party applications. When running jobs in SAP Data Hub, whether using the Metadata Explorer profiling data sets or executing pipelines in the SAP Data Hub modeler, you can monitor the overall activity.

Monitor the Activity In the launchpad, you can use the monitoring tool to monitor all the activities in the system, including pipeline jobs, profiling jobs, and metadata extraction jobs. This automatically launches a new tab in the browser with the Monitoring tool that enables you to monitor the activity with overall analytics and more detailed information about job instances and schedules, as shown in Figure 4.33. In addition, you can drill down further into any job from this screen.

Personal Copy for Mario Massimiliano Biffi, [email protected]

79

5 Outlook and Road Map

Figure 4.33 SAP Data Hub Monitoring

5

Outlook and Road Map

Since SAP Data Hub launched in September 2017, it has evolved rapidly to address data management requirements in complex and distributed landscapes for data of all shapes and formats. This section summarizes the current outlook for SAP Data Hub; however, be aware that information shared is subject to change by SAP and merely reflects the current best knowledge of the road map.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

80

5 Outlook and Road Map

This section outlines the outlook for SAP Data Hub deployment in additional public clouds and private clouds, as well as the extensive product capabilities to address integration with SAP and non-SAP applications, selfservice data preparation natively within SAP Data Hub, comprehensive metadata management, and supporting the data science and machine learning platforms.

5.1

Deployment

In addition to existing supported public cloud (hyperscalers) and certified partner’s public clouds as listed in SAP Data Hub PAM, new public and private clouds are being added and will continue to be added in the future. SAP intends to provide SAP Data Hub as a fully managed pay-per-use cloud offering in SAP Cloud Platform. The planned features include the following: 쐍 Integration into SAP Cloud Platform and using SAP’s Kubernetes offering

of project "Gardener" 쐍 Fully automated deployment procedure 쐍 Includes compute layer, storage layer, and administration of the Kuber-

netes cluster 쐍 Completely monitored and managed by SAP Data Hub DevOps teams 쐍 Multistep, pay-per-use commercialization 쐍 Base for platform-as-a-service (PaaS) offerings 쐍 Integration with SAP HANA as a service 쐍 Integration with further SAP Cloud Platform services (e.g., Kafka, IoT ser-

vices)

Personal Copy for Mario Massimiliano Biffi, [email protected]

81

5 Outlook and Road Map

5.2

Capabilities

In the following sections, we’ll look at some functionalities that are currently planned to be added to SAP Data Hub.

Unified Data Integration and Processing SAP Data Hub is planned to support integration of all SAP applications (SAP ERP, SAP S/4HANA, SAP S/4HANA Cloud, SAP BW and SAP B/4HANA, SAP C/4HANA, SAP Concur, SAP Fieldglass, SAP Ariba, CallidusCloud, Qualitrics, etc.), non-SAP applications, and all external and distributed sources, as outlined in Figure 5.1. Data-driven applications

IoT

Machine Learning

Analytics/SAP BW



SAP BW SAP S/4HANA

BW Process Chains

Data Services Jobs

SAP IBP

SAP HANA Flowgraphs

ABAP Integration

B2B

Enterprise applications (APIs, business processes, business functions, etc.)

Data Lake

Cloud Stores (S3, GCS, WASB, etc.); Hadoop/Hive/ Spark, etc.

Enterprise Sources

Data warehouses/data marts; Databases (Oracle, SQL Server, DB2, etc.); Other enterprise solutions (e.g., Salesforce.com, cloud elements, OSIsoft, Adobe, SAP EIM, etc.)

Streams/ Semistructured

IoT devices, things, etc.; Kafka, MQTT, Google Sub/ Pub, etc.

Unstructured

Multimedia (PPT, documents, images, audio, video, etc.); OpenText, Box, Dropbox, etc.

SAP Data Hub

Workflow

Metadata Management C4C SAP Fieldglass

GIGYA

SAP Ariba

SAP Concur

Business Apps

Business Services

Orchestration Cloud Integration API

Processing and Pipelines Integration and Ingestion

SCP SAP HANA

SAP HANA/ SAP Vora Integration

Distributed Data Lake (SAP HANA, SAP Cloud Platform Big Data Services, Object Stores, etc.)

SAP Applications

External and Distributed Data Sources

Figure 5.1 Comprehensive Platform to Provide Integration to SAP Applications and External and Distributed Data Sources

The strategy to connect all SAP applications, particularly SAP cloud applications, is shown in Figure 5.2.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

82

5 Outlook and Road Map

SAP Analytics Cloud

SAP BW/4HANA

SAP Data Hub

Cloud Data Integration API

Intelligent Suite Figure 5.2 Integration with SAP Cloud Applications

Integration with SAP cloud applications includes the definition of a uniformed SAP Cloud Data Integration API for scalable, consistent, real-time, and seamless data integration to the Intelligent Suite. The main use cases for this uniform SAP Cloud Data Integration API are as follows: 쐍 Holistic data management with the SAP Data Hub and SAP HANA 쐍 Building data warehouse analytics with SAP BW/4HANA 쐍 Advanced analytics and planning with SAP Analytics Cloud

SAP Data Hub enables seamless integration of data and metadata across all SAP solutions in the cloud using one API based on open standards. Close and scalable data integration to the Intelligent Suite is a cornerstone of this strategy. An example of this API is shown in Figure 5.3.

Figure 5.3 Example of an Integration with SAP Cloud Applications via One API

Personal Copy for Mario Massimiliano Biffi, [email protected]

83

5 Outlook and Road Map

Figure 5.2 showed a close and scalable data integration from the intelligent suite as a foundation for analytics, data warehousing, and data management. This data integration need support for a wide list of sources, targets, mode of data integration, and various protocols used in data integration. They are as follows: 쐍 Seamless integration of data and metadata for all SAP cloud solutions

(SAP Fieldglass, SAP S/4HANA, SAP Ariba, SAP Concur, etc.) 쐍 Scalable, consistent, and real-time data integration across solutions 쐍 OData v4 communication protocol

The vision is to support both full and delta requests when the source supports delta change. It will be integrated with metadata management, data transformation, and the SAP Data Hub pipeline modeler.

Optimization Integration with ABAP Applications SAP Data Hub’s vision is to support a comprehensive way to integrate with all SAP ABAP applications. The scope is as follows: 쐍 Supports RFC and HTTP(s) 쐍 Provides connectivity to SAP S/4HANA (on premise and cloud) and SAP

NetWeaver 쐍 Offers a rich semantics layer, including core data services (CDS)-based

extraction (with delta changes) in real time 쐍 Uses new ABAP pipeline engine to extend modeling capabilities and

includes ABAP-specific aspects 쐍 Provides workbench to create customer-specific ABAP operators 쐍 Enables browsing, indexing, publishing, profiling, and data lineage of any

ABAP-based systems via metadata integration

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

84

5 Outlook and Road Map

SAP Cloud Platform Open Connector SAP Data Hub intends to simplify connectivity to third-party apps by providing feature rich, prebuilt connectors to more than 150 non-SAP cloud or on-premise applications. This comprehensive and seamless integration to third-party apps will work out of the box. The motivations for this SAP Cloud Platform open connector are as follows: 쐍 Provides out-of-the-box connectivity to third-party apps via harmonized

REST APIs 쐍 Easily builds API compositions across connectors 쐍 Develops and maps canonical data models to extend prebuilt connectors

Integration with third-party applications via SAP Cloud Platform open connectors will be accelerated. Specific benefits include the following: 쐍 Inherits more than 150 standard connectivities to non-SAP cloud and on-

premise applications 쐍 Harmonized APIs to reduce cost of third-party integration 쐍 Normalized authentication, error handling, search, pagination, and bulk

support 쐍 Standardized events that supports polling and webhooks 쐍 Support for metadata extraction, data preview, and read (GET and *BULK) 쐍 Extends write and change data capture (CDC) in future releases

SAP Cloud Platform open connectors were introduced in the May 2019 release. However, these functionalities are still scheduled for upcoming releases.

Embedded Self-Service Data Preparation SAP Data Hub will enable nontechnical users, such as business users or citizen data scientists, to leverage, access, assess, harmonize, shape, and enrich data without requiring modeling a pipeline or writing a single line of code. The extend capabilities of SAP Data Hub are depicted in Figure 5.4.

Personal Copy for Mario Massimiliano Biffi, [email protected]

85

5 Outlook and Road Map

SAP Data Hub

Enterprise apps and BI tools Self-services

Metadata management

Pipelining

SAP HANA (on premise, cloud, and multicloud)

Refinement

Orchestration

ML, predictive analytics

Sources and systems (SAP, non-SAP, on premise or cloud)

SAP solutions for enterprise information management (EIM)

Big Data services from SAP

Third-party data lakes

Cloud object stores

Figure 5.4 Embedded Self-Service Data Preparation in SAP Data Hub

A few capabilities of self-service data preparation in SAP Data Hub include the following: 쐍 Integrate with the Metadata Explorer to access catalog search results

directly. 쐍 Smart sampling of data sets is provided, which is helpful when a source is

very large. 쐍 Transform data using a simple click. 쐍 View results instantly during design. 쐍 Apply the transformations to the full data set at any time.

Users may decide to execute the defined list of refinements to the entire data set. The list of actions that are recorded in the recipe will be applied to the full original data set to produce an output that the user defined. Users will be able to create the output of this execution or append the content of the processed data to an existing output. As the execution of this recipe on the full original data set can take some time, users will be able to review and manage the execution requests, such as pausing or canceling the execution of a process.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

86

5 Outlook and Road Map

Metadata Solutions Use a self-learning approach to improve the consistency, accuracy, and completeness of the metadata of all shapes of data to provide active data governance with SAP Data Hub. This will allow for the following: 쐍 Use a unified metadata catalog to gain visibility about landscape-wide

data assets. 쐍 Easily govern and manage metadata assets across an enterprise system

with disparate sources. 쐍 Discover, understand, and consume information about data with the

ability to synchronize, share, and perform version lineage and impact analysis. 쐍 Answer related information requests without browsing through multiple

systems or repositories or touching various data models. 쐍 Support nondomain experts in evaluating data quality and the impact of

changes. 쐍 Enable active governance based on risk and policies. 쐍 Use metadata marketplaces to support the monetizing of metadata.

5.3

Integration with SAP Data Intelligence

To support the intelligent enterprise, SAP Data Hub will support data science platforms by combining SAP Predictive Analytics, SAP Leonardo Machine Learning Foundation, SAP HANA machine learning, and open source languages and libraries with complete lifecycle management with an integrated development environment (IDE) that has the following capabilities: 쐍 Holistic data science experience, spanning from data exploration to

monitoring productive use via an ML IDE 쐍 Seamless integration of SAP, non-SAP, and third-party data sources 쐍 One central repository for all ML artifacts

Personal Copy for Mario Massimiliano Biffi, [email protected]

87

5 Outlook and Road Map

SAP Data Intelligence is a unified integrated solution offering one data science frontend with the full lifecycle management to manage AI needs at scale by combining the power of the open source community and SAP’s machine learning capabilities with enterprise reliability and scale. The solution extends SAP Data Hub capabilities, such as addressing the missing link between big data and enterprise data, governing complex modern landscapes, or productizing complex data scenarios, with additional capabilities dedicated to the design and the operationalization of AI processes. The three main aspects of SAP Data Intelligence are to manage the data, the design, and the delivery.

Manage the Data To drive successful business innovation based on AI, a company needs to manage the data properly and connect to all the data. The company can then access, transform, and reuse these assets the right way, and tag, curate, and share existing or new derived information for reusability. Thanks to SAP Data Hub, SAP Data Intelligence offers the ability to access any data source: cloud, on premise, IoT, SAP, or non-SAP. You can automatically index and crawl any available data asset to find what you need with ease. End users can leverage a rich set of operators to transform multiple data connections into a usable data set for AI modeling needs as well as save and reuse the data sets across the organization. Finally, they can enrich data assets with metadata information so that everyone in your organization can find what they need and minimize unnecessary duplication of data wrangling tasks.

Manage the Design To drive successful business innovation based on AI, the project design must be managed properly, giving access to the right tools, offering AI as a service, and making model deployments seamless and simple.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

88

6 What’s Next?

The data scientists can use the tools they already know, such as JupyterLab or the pipelining capabilities as part of the same environment as the main IT platform so that all personas can collaborate, share, and understand what was built. The solution offers the ability to spin up any lab environment supporting major SAP and open source languages and libraries, so no time is wasted waiting for infrastructure to be ready. Finally, the deployment is made simple by promoting lab environments into production environments using simple UI interactions.

Manage the Delivery To drive successful business innovation based on AI, the delivery must be managed properly by automating low-value tasks, managing everything at once, and reducing the infrastructure costs. The solution offers the ability to automate low-value tasks by providing a model performance and lifecycle that is automatically managed so that IT and data scientists can focus on the tasks that really matter. The one single interface allows you to manage everything at once by showing all models across the organization in a unified workspace with tools to help manage governance, auditability, and transparency. Finally, it reduces the infrastructure costs by bringing a unique serverless cloud architecture with pricing that is aligned with usage when needed only.

6

What’s Next?

Now that you’ve explored SAP Data Hub, it’s time to learn even more about data provisioning. SAP offers many tools beyond SAP Data Hub for this process, from SDI and SDQ to SAP Data Services and SAP LT Replication Server. It’s time to standardize, integrate, and secure your data!

Personal Copy for Mario Massimiliano Biffi, [email protected]

89

6 What’s Next?

Recommendation from Our Editors Looking to provision data for SAP HANA? Get to know your options with Data Provisioning for SAP HANA by Cundiff, Gomes, Lamb, Loden, and Suneja! From SAP Agile Data Preparation to SAP Data Services, you’ll learn how and when to use each data provisioning tool. Visit www.sap-press.com/4588 to check out Data Provisioning for SAP HANA!

In addition to this book, our editors picked a few other SAP PRESS publications that you might also be interested in. Check out the next page to learn more!

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

90

More from SAP PRESS SAP Master Data Governance—The Comprehensive Guide to SAP MDG: Build a firm foundation in data, process, and UI modeling. Take your skills to the next level with complete coverage of data quality, search, and consolidation functionality. 627 pages, pub. 08/2017 E-book: $69.99 | Print: $79.95 | Bundle: $89.99 www.sap-press.com/4192

Data Migration with SAP: This comprehensive guide not only leads you through project planning, but also gives you step-by-step instructions for executing your migration with LSMW, SAP Data Services, the batch input technique, and more. 563 pages, 3rd edition, pub. 03/2016 E-book: $69.99 | Print: $79.95 | Bundle: $89.99 www.sap-press.com/4019

SAP Data Services—The Comprehensive Guide: Learn about topics like planning, blueprinting, and integrating SAP Data Services. Get the skills you need for your daily job, from basic tasks like designing objects, to advanced duties like analyzing unstructured text. 524 pages, pub. 02/2015 E-book: $69.99 | Print: $79.95 | Bundle: $89.99 www.sap-press.com/3688

Personal Copy for Mario Massimiliano Biffi, [email protected]

1

Usage, Service, and Legal Notes Notes on Usage This E-Bite is protected by copyright. By purchasing this E-Bite, you have agreed to accept and adhere to the copyrights. You are entitled to use this E-Bite for personal purposes. You may print and copy it, too, but also only for personal use. Sharing an electronic or printed copy with others, however, is not permitted, neither as a whole nor in parts. Of course, making them available on the Internet or in a company network is illegal. For detailed and legally binding usage conditions, please refer to the section Legal Notes.

Service Pages The following sections contain notes on how you can contact us.

Praise and Criticism We hope that you enjoyed reading this E-Bite. If it met your expectations, please do recommend it. If you think there is room for improvement, please get in touch with the editor of the book: Meagan White (meaganw@ rheinwerk-publishing.com). We welcome every suggestion for improvement but, of course, also any praise! You can also share your reading experience via Twitter, Facebook, or email.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

Technical Issues If you experience technical issues with your e-book or e-book account at SAP PRESS, please feel free to contact our reader service: support@ rheinwerk-publishing.com.

About Us and Our Program The website http://www.sap-press.com provides detailed and first-hand information on our current publishing program. Here, you can also easily order all of our books and e-books. Information on Rheinwerk Publishing Inc. and additional contact options can also be found at http://www.sappress.com.

Legal Notes This section contains the detailed and legally binding usage conditions for this E-Bite.

Copyright Note This publication is protected by copyright in its entirety. All usage and exploitation rights are reserved by the author and Rheinwerk Publishing; in particular the right of reproduction and the right of distribution, be it in printed or electronic form. © 2019 by Rheinwerk Publishing, Inc., Boston (MA)

Your Rights as a User You are entitled to use this E-Bite for personal purposes only. In particular, you may print the E-Bite for personal use or copy it as long as you store this copy on a device that is solely and personally used by yourself. You are not entitled to any other usage or exploitation.

Personal Copy for Mario Massimiliano Biffi, [email protected]

In particular, it is not permitted to forward electronic or printed copies to third parties. Furthermore, it is not permitted to distribute the E-Bite on the Internet, in intranets, or in any other way or make it available to third parties. Any public exhibition, other publication, or any reproduction of the E-Bite beyond personal use are expressly prohibited. The aforementioned does not only apply to the E-Bite in its entirety but also to parts thereof (e.g., charts, pictures, tables, sections of text). Copyright notes, brands, and other legal reservations as well as the digital watermark may not be removed from the E-Bite.

Digital Watermark This E-Bite copy contains a digital watermark, a signature that indicates which person may use this copy. If you, dear reader, are not this person, you are violating the copyright. So please refrain from using this E-Bite and inform us about this violation. A brief email to [email protected] is sufficient. Thank you!

Limitation of Liability Regardless of the care that has been taken in creating texts, figures, and programs, neither the publisher nor the author, editor, or translator assume any legal responsibility or any liability for possible errors and their consequences. 1

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

Imprint This E-Bite is a publication many contributed to, specifically: Editor Meagan White Acquisitions Editor Hareem Shafi Copyeditor Julie McNamee Cover Design Graham Geary Layout Design Graham Geary Production Kelly O'Callaghan Typesetting SatzPro, Krefeld (Germany) ISBN 978-1-4932-1754-0 © 2019 by Rheinwerk Publishing, Inc., Boston (MA) 1st edition 2019

All rights reserved. Neither this publication nor any part of it may be copied or reproduced in any form or by any means or translated into another language, without the prior consent of Rheinwerk Publishing, 2 Heritage Drive, Suite 305, Quincy, MA 02171. Rheinwerk Publishing makes no warranties or representations with respect to the content hereof and specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. Rheinwerk Publishing assumes no responsibility for any errors that may appear in this publication. “Rheinwerk Publishing” and the Rheinwerk Publishing logo are registered trademarks of Rheinwerk Verlag GmbH, Bonn, Germany. SAP PRESS is an imprint of Rheinwerk Verlag GmbH and Rheinwerk Publishing, Inc. All of the screenshots and graphics reproduced in this book are subject to copyright © SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany.

© 2019 by Rheinwerk Publishing Inc., Boston (MA)

SAP, the SAP logo, ABAP, Ariba, ASAP, Concur, Concur ExpenseIt, Concur TripIt, Duet, SAP Adaptive Server Enterprise, SAP Advantage Database Server, SAP Afaria, SAP ArchiveLink, SAP Ariba, SAP Business ByDesign, SAP Business Explorer, SAP BusinessObjects, SAP BusinessObjects Explorer, SAP BusinessObjects Lumira, SAP BusinessObjects Roambi, SAP BusinessObjects Web Intelligence, SAP Business One, SAP Business Workflow, SAP Crystal Reports, SAP EarlyWatch, SAP Exchange Media (SAP XM), SAP Fieldglass, SAP Fiori, SAP Global Trade Services (SAP GTS), SAP GoingLive, SAP HANA, SAP HANA Vora, SAP Hybris, SAP Jam, SAP MaxAttention, SAP MaxDB, SAP NetWeaver, SAP PartnerEdge, SAPPHIRE NOW, SAP PowerBuilder, SAP PowerDesigner, SAP R/2, SAP R/3, SAP Replication Server, SAP S/4HANA, SAP SQL Anywhere, SAP Strategic Enterprise Management (SAP SEM), SAP SuccessFactors, The Best-Run Businesses Run SAP, TwoGo are registered or unregistered trademarks of SAP SE, Walldorf, Germany. All other products mentioned in this book are registered or unregistered trademarks of their respective companies.