[Big Data] DataX Introduction

DataX Getting Started


1.1 Introduction

DataX is a widely used heterogeneous data source offline synchronization tool in Alibaba Group, dedicated to the implementation of relational databases (MySQL, Stable and efficient data synchronization between various heterogeneous data sources such as Oracle, HDFS, Hive, MaxCompute (original ODPS), HBase, and FTP.

DataX itself is used as an offline data synchronization framework, built using the Framework + plugin architecture. The data source reads and writes are abstracted into a Reader/Writer plugin that is incorporated into the entire synchronization framework. At present, there is a relatively comprehensive plug-in system, and the mainstream RDBMS database, NOSQL, and big data computing systems have all been accessed.

DataX currently supports the following data:

1.2Design Idea

In order to solve the problem of heterogeneous data source synchronization, DataX turns the complex mesh synchronization link into a star data link, DataX As an intermediate transport carrier, it is responsible for connecting various data sources. When you need to access a new data source, you only need to connect this data source to DataX to seamlessly synchronize data with existing data sources.

1.3 Framework Design

DataX itself as an offline data synchronization framework, built using the Framework + plugin architecture. The data source reads and writes are abstracted into a Reader/Writer plugin that is incorporated into the entire synchronization framework.

Reader: Reader is a data acquisition module that collects data from the data source and sends the data to the Framework.

Writer: Writer is a data writing module that is responsible for continuously fetching data from the Framework and writing the data to the destination.

Framework: Framework is used to connect reader and writer as the data transmission channel of both, and handle core technical issues such as buffering, flow control, concurrency, and data conversion. The

DataX 3.0 open source version supports stand-alone multi-threaded mode to complete synchronous job runs. For details, please refer to: Point me

1.4 Advantages

1, reliable data quality monitoring (allowing data to be transmitted intact to the destination)

2, rich data conversion function

3 Accurate speed control

4, the new version of DataX3.0 provides three flow control modes including channel (concurrent), recording stream and byte stream. You can control your job speed at will and let your job in the library. The optimum synchronization speed can be achieved within the range that can be tolerated.

5, strong synchronization performance

Each type of plug-in has one or more segmentation strategies, which can be divided into multiple tasks in parallel, and the single-machine multi-threaded execution model can make DataX Speed ​​increases linearly with concurrency.

6, robust fault-tolerant mechanism (multi-level local/global retry)

7, minimalist experience

Downloadable, detailed log information.

2.Related Concepts

Heterogeneous Data Sources

refers to data between different database management systems. In the process of enterprise informatization construction, due to the phased, technical and other economic and human factors factors of the construction and implementation of the data management system, the enterprises have accumulated a large number of business data using different storage methods in the development process. The data management systems involved include very different, from simple file databases to complex network databases, which form a heterogeneous data source for the enterprise.

Enterprise data source heterogeneity is mainly manifested in three aspects:

1. System heterogeneity, that is, the difference between the business application system, the database management system and the operating system that the data source depends on constitutes a different system. Structure.

2. Pattern heterogeneity, that is, the difference in data source storage mode. The storage mode mainly includes a relational mode, an object mode, an object relational mode, and a document nesting mode, wherein the relational mode (relational database) is a mainstream storage mode. At the same time, even the same type of storage mode, their pattern structure may be different. For example, the data types of different relational data management systems are not completely consistent, such as DB2, Oracle, Sybase, Informix, SQL Server, Foxpro, and so on.

3. Source heterogeneity, which is the heterogeneity between the internal data source and the external data source.

3. Build DataX test

After testing the demo, the effect is as follows:


The results in the graph and the working principle of DataX are consistent, deepen the understanding, and the execution time of the task in the result The indicators are listed, it is very clear. The next step is to start using the configuration in the project.