Since Hive uses the SQL query language HQL, it is easy to understand Hive as a database. In fact, from a structural point of view, Hive and the database have no similarities except for a similar query language. The   database can be used in online applications, but Hive is designed for data warehousing. This is clear to help you understand Hive's features from an application perspective.

一, Hive is a data warehouse platform based on Hadoop. With hive, we can easily work with ETL. Hive defines a SQL-like query language: HQL, which converts user-written QL into a corresponding Mapreduce program based on Hadoop.

Hive is a data warehouse framework that Facebook just opened in August 2008. Its system goal is similar to Pig, but it has some mechanisms that Pig does not currently support, such as: richer type system, more SQL-like query language, persistence of Table/Partition metadata, etc.

Hive can be seen as the data from the SQL to Map-Reduce mapper


Data in the HDFS warehouse directory, a table corresponds to a subdirectory. The

local/tmp directory stores logs and execution plans

hive. The tables are divided into two types, internal and external. When Hive creates an internal table, it moves the data to the path pointed to by the data warehouse; if an external table is created, only the path where the data is located is recorded, and no changes are made to the location of the data. When the table is deleted, the metadata and data of the internal table are deleted together, and the external table only deletes the metadata and does not delete the data. This makes the external table relatively safer, and the data organization is more flexible, making it easy to share source data.

Use Mysql as the storage database of Hive metaStore

The main tables involved are as follows:

Let's look at their similarities and differences.

  1. Query language. Because SQL is widely used in data warehouses, the SQL-like query language HQL is designed specifically for Hive's features. Developers familiar with SQL development can easily use Hive for development.

  2. Data storage location. Hive is built on top of Hadoop, and all Hive data is stored in HDFS. The database can save the data in the local file system.

  3. Data Format. There is no specific data format defined in Hive. The data format can be specified by the user. The user-defined data format needs to specify three attributes: column separator (usually space, "\t", "\x001"), line separator (" \n") and the method of reading file data (there are three file formats TextFile, SequenceFile and RCFile by default in Hive). Since the data format is not converted from the user data format to the Hive-defined data format during the loading process, Hive does not modify the data itself during the loading process, but simply copies or moves the data content to the corresponding In the HDFS directory. In the database, different databases have different storage engines and define their own data formats. All data is stored in a certain organization, so the process of loading data into the database can be time consuming.

  4. Data Update. Because Hive is designed for data warehousing applications, the content of the data warehouse is read and write less. Therefore, Hive does not support rewriting and adding data, all data is determined at the time of loading. The data in the database usually needs to be modified frequently, so you can use INSERT INTO ... VALUES to add data and UPDATE ... SET to modify the data.

  5. index. As mentioned before, Hive does not process any data during the process of loading data, and does not even scan the data, so it does not index some Keys in the data. When Hive wants to access a specific value in a data that satisfies the condition, it needs to scan the entire data violently, so the access latency is high. Due to the introduction of MapReduce, Hive can access data in parallel, so even without indexing, Hive can still show advantages for large data access. In the database, one or several columns are usually indexed, so for a small amount of access to specific conditions, the database can be highly efficient and has low latency. Due to the high latency of data access, Hive is not suitable for online data queries.

  6. carried out. The execution of most queries in Hive is implemented through MapReduce provided by Hadoop (similar to select * from tbl queries do not require MapReduce). Databases usually have their own execution engine.

  7. Execution delay. As mentioned before, when Hive queries data, because there is no index, the entire table needs to be scanned, so the delay is high. Another factor that causes Hive execution latency is the MapReduce framework. Due to the high latency of MapReduce itself, there is a high latency when performing Hive queries with MapReduce. In contrast, database execution latency is low. Of course, this low is conditional, that is, the data size is small. When the data size is large enough to exceed the processing power of the database, Hive's parallel computing clearly shows its advantages.

  8. Scalability. Since Hive is built on top of Hadoop, Hive's scalability is consistent with Hadoop's scalability (the world's largest Hadoop cluster is in Yahoo!, with a scale of around 4000 nodes in 2009). The database is very limited due to the strict limitation of ACID semantics. At present, the most advanced parallel database Oracle has a theoretical expansion capacity of only about 100 units.

  9. 数据尺寸. Because Hive is built on a cluster and can use MapReduce for parallel computing, it can support large-scale data; correspondingly, the database can support smaller data sizes.

Look at these, I said why the hive query data is so slow.

Finally, let's take a look at the difference between database and data warehousing. The

  1. database is a transaction-oriented design, and the data warehouse is designed for themes. The database generally stores online transaction data, and the data warehouse stores generally historical data.
  2. Database design is to avoid redundancy as much as possible. Generally, it is designed according to the rules of the paradigm. The data warehouse is designed to introduce redundancy and adopt anti-paradigm design. The
  3. database is designed to capture data. The data warehouse is designed to analyze data. Its two basic elements are dimension tables and fact tables. (Dimensions are the perspective of the problem, such as time, department, dimension table is the definition of these things, the fact table contains the data to be queried, and there is a dimension ID)