Talking about Big Data – Spark on Hive & Hive on Spark

Table of contents

Difference between Spark on hive and Hive on Spark

Hive query process and principle

Hive converts SQL to MapReduce and executes slowly

Hive On Spark Optimization

Features of the Hive Metabase

Hive opens MetaStore service

Difference between Spark on hive and Hive on Spark

  • Spark on hive

[Spark] uses hive statements through Spark-SQL to operate hive, and the bottom layer is still spark rdd.

(1) It is to load the hive configuration file through sparksql and obtain the metadata information of hive

(2) After spark sql obtains the metadata information of hive, it can get the data of all tables of hive

(3) Next, you can manipulate the data in the hive table through spark sql

  • Hive on Spark

It is to replace the hive query from the mr (Hadoop computing engine) operation of mapreduce to the spark rdd (spark execution engine) operation. Compared with spark on hive, this is a lot more troublesome to implement. You must recompile your spark and import the jar package. However, most of them are currently using spark on hive.

For the most authoritative explanation, see the Apache Spark official website:

http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

Hive query process and principle

When executing HQL, first go to the MySQL metadata database to find description information, then parse HQL and generate MR tasks based on the description information

Hive converts SQL to MapReduce and executes slowly

Using SparkSQL to integrate Hive is actually to let SparkSQL load the Hive metadata database, and then use the SparkSQL execution engine to operate the data in the Hive table

First, you need to enable Hive’s metadata database service so that SparkSQL can load metadata.

Hive On Spark Optimization

The original computing model of Hive is MR, which is inefficient for frequent disk operations (writing intermediate results to HDFS). And Hive On Spark uses RDD (Dataframe), and then runs on the spark cluster. The metadata is stored in mysql, which contains the description information of the hive table, describes those databases, tables, and how many columns the table has, what type each column is, and also describes where the data of the table is stored in HDFS.

Features of the Hive Metabase

The metadata of hive establishes a mapping relationship. To execute HQL, first go to the Mysql metadata database to find description information, then generate tasks according to the description information, and then send the tasks to the spark cluster for execution. Hive on spark only uses hive standards and specifications, and can be used without a hive database. To use the Hive standard, you need to put the hive configuration file in the spark conf directory. Not having Hive components installed has no effect either.

Hive opens MetaStore service

Modify hive/conf/hive-site.xml and add the following configuration:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
      <name>hive.metastore.warehouse.dir</name>
      <value>/user/hive/warehouse</value>
    </property>
    <property>
      <name>hive.metastore.local</name>
      <value>false</value>
    </property>
    <property>
      <name>hive.metastore.uris</name>
      <value>thrift://node01:9083</value>
    </property>
 </configuration>

Start the Hive MetaStore service in the background

nohup /export/servers/hive/bin/hive --service metastore 2>&1 >> /var/log.log &

Leave a Comment

Your email address will not be published. Required fields are marked *