atlas总结
2022-06-22 12:49:19 2 举报
AI智能生成
元数据管理框架atlas
作者其他创作
大纲/内容
深入剖析
官网文档 http://atlas.apache.org/2.1.0/index.html#/<br>Atlas开发指南(中文版)https://mantoudev.com/mantouBook/Atlas_cn/
简介
https://www.cnblogs.com/mantoudev/p/9986408.html
atlas数据模型(重点理解,后面会自定义)
Atlas的术语表(Glossary)
https://www.cnblogs.com/mantoudev/p/9965869.html
atlas-webapp中与术语相关的操作API
webapp/src/main/java/org/apache/atlas/web/rest/GlossaryREST.java
Atlas的元数据模型Type System
https://www.cnblogs.com/mantoudev/p/9985600.html
https://cloud.tencent.com/developer/article/1503998
Atlas允许用户为他们想要管理的元数据对象定义模型。
该模型由称为type(类型)的定义组成。称为entities(实体)的type(类型)实例表示受管理的实际元数据对象。
Type System是一个允许用户定义和管理类型和实体的组件。<br>开箱即用的Atlas管理的所有元数据对象(例如Hive表)都使用类型建模并表示为实体。
要在Atlas中存储新类型的元数据,需要了解类型系统组件的概念。
atlas元数据和索引存储
Atlas使用JanusGraph存储和管理元数据。<br>默认情况下,Atlas使用独立的HBase实例作为JanusGraph的底层存储。<br>Atlas通过JanusGraph索引元数据以支持全文搜索查询。<br>为了给索引存储提供HA,我们建议将Atlas配置为使用Solr或Elasticsearch作为JanusGraph的索引存储支撑<br>图数据库引擎JanusGraph<br>图存储后端-hbase/cassandra<br>图索引后端-solr/elasticsearch
源代码模块结构分析<br>
https://www.cnblogs.com/wang3680/p/13968277.html<br>
别人的实践记录
Atlas 2.1.0 实践(1)—— 编译Atlas<br>
https://cloud.tencent.com/developer/article/1764110
Atlas 2.1.0 实践(2)—— 安装Atlas
https://cloud.tencent.com/developer/article/1768539?from=article.detail.1764110
Apache Atlas 1.2.0 部署手册(基于集群已有组件HBase和ElasticSearch,不使用内嵌的HBase和Solr)<br>
https://blog.csdn.net/xueyao0201/article/details/94310199<br>
Atlas 2.1.0 实践(3)—— Atlas集成HIve
https://cloud.tencent.com/developer/article/1781542
Atlas 2.1.0 实践(4)—— 权限控制
https://cloud.tencent.com/developer/article/1785134
元模型
概述
Atlas 用 Type/Entity 模型来组织所有的元数据对象,它们的关系相当于 OOP 中对应的 Class/Instance<br>Type 可以分为多个 Metatype:Enum/Collection(Array,Map)/Composite(Entity, Struct, Classification, Relationship)<br>Composite 可以有多 Attribute,而 Attribute 可以指向 Metatype 从而建立丰富的关系,<br>有趣的是 Entity 和 Classification 是可以继承关系的,<br>真正存放元数据信息的叫 Entity,例如:一张 Hive 表。
源码分析
主要有以下几个概念:<br>Type类型<br>Entity 实体<br>Attributes属性
AtlasType: intg/src/main/java/org/apache/atlas/type/AtlasType.java<br>TypeCategory:intg/src/main/java/org/apache/atlas/model/TypeCategory.java<br> PRIMITIVE, OBJECT_ID_TYPE, ARRAY, MAP, ENUM, STRUCT, <br> CLASSIFICATION, ENTITY, RELATIONSHIP, BUSINESS_METADATA<br>SuperTypes:<br> Asset、 DataSet、 Process、 Referenceable<br><br><br>intg/src/main/java/org/apache/atlas/model/typedef/AtlasBaseTypeDef.java 抽象基础类型<br>-----intg/src/main/java/org/apache/atlas/model/typedef/AtlasEnumDef.java 枚举类型<br>---------- private List<AtlasEnumElementDef> elementDefs;<br><br>-----intg/src/main/java/org/apache/atlas/model/typedef/AtlasStructDef.java 结构类型<br>---------- private List<AtlasAttributeDef> attributeDefs;<br>----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasClassificationDef.java 分类类型<br><br>----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasRelationshipDef.java 关系类型<br>---------- intg/src/main/java/org/apache/atlas/model/typedef/AtlasRelationshipEndDef.java<br><br>----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasEntityDef.java 实体类型<br>---------- private List<AtlasRelationshipAttributeDef> relationshipAttributeDefs;<br>---------- private Map<String, List<AtlasAttributeDef>> businessAttributeDefs;<br><br>----------intg/src/main/java/org/apache/atlas/model/typedef/AtlasBusinessMetadataDef.java 业务元数据类型<br><br><br>intg/src/main/java/org/apache/atlas/model/typedef/AtlasTypesDef.java Type类型<br> private List<AtlasEnumDef> enumDefs;<br> private List<AtlasStructDef> structDefs;<br> private List<AtlasClassificationDef> classificationDefs;<br> private List<AtlasEntityDef> entityDefs;<br> private List<AtlasRelationshipDef> relationshipDefs;<br> private List<AtlasBusinessMetadataDef> businessMetadataDefs;
示例分析
Hive
addons/hive-bridge/src/main/java/org/apache/atlas/hive/model/HiveDataTypes.java<br>addons/models/1000-Hadoop/1030-hive_model.json<br>docs/src/documents/Hook/HookHive.md
Kafka
addons/kafka-bridge/src/main/java/org/apache/atlas/kafka/model/KafkaDataTypes.java<br>addons/models/1000-Hadoop/1070-kafka_model.json<br>docs/src/documents/Hook/HookKafka.md
Sqoop
addons/sqoop-bridge/src/main/java/org/apache/atlas/sqoop/model/SqoopDataTypes.java<br>addons/models/1000-Hadoop/1040-sqoop_model.json<br>docs/src/documents/Hook/HookSqoop.md
如何自定义扩展模型<br>
https://atlas.apache.org/2.1.0/index.html#/TypeSystem<br>https://www.cnblogs.com/163yun/p/9015985.html<br>https://www.cnblogs.com/mantoudev/p/9985600.html<br>https://blog.csdn.net/rlnLo2pNEfx9c/article/details/106846113
元数据集成之离线导入demo数据
本示例最主要目的
(1)展示如何以离线方式创建自定义的元数据,包括类型系统和具体实体<br>(2)展示如何查询元数据实体及血缘关系
执行入口
python bin/quick_start.py
python源码
distro/src/bin/quick_start.py
java主类
webapp/src/main/java/org/apache/atlas/examples/QuickStartV2.java<br> 主要是构建元数据,然后通过REST接口发送给Atlas服务器
REST地址: <br> ######### Server Properties #########<br> atlas.rest.address=http://localhost:21000<br> java客户端工具类:<br> client/client-v2/src/main/java/org/apache/atlas/AtlasClientV2.java<br>
主要逻辑分析
创建类型系统
// Shows how to create v2 types in Atlas for your meta model<br> quickStartV2.createTypes();<br> 其中核心逻辑:<br> AtlasTypesDef atlasTypesDef = createTypeDefinitions();<br> atlasClientV2.createAtlasTypeDefs(atlasTypesDef);
创建类型系统的各种实体
// Shows how to create v2 entities (instances) for the added types in Atlas<br> quickStartV2.createEntities();<br> 其中核心逻辑:<br> 构建各种Type的实例AtlasEntity<br> intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java<br> 然后调用AtlasClientV2发送请求<br> EntityMutationResponse response = atlasClientV2.createEntity(entityWithExtInfo);
展示 DSL Queries
// Shows some search queries using DSL based on types<br> quickStartV2.search();<br> 其中核心逻辑:<br> AtlasSearchResult results = atlasClientV2.dslSearchWithParams(dslQuery, 10, 0);<br>
展示如何查询实体的血缘关系
// Shows some lineage information on entity<br> quickStartV2.lineage();<br> 其中核心逻辑:<br> AtlasLineageInfo lineageInfo = <br>atlasClientV2.getLineageInfo(getTableId(SALES_FACT_DAILY_MV_TABLE), LineageDirection.BOTH, 0);<br>
集成hive测试
atlas集成hive配置
将atlas-application.property copy至hive客户端的conf目录
将atlas项目编译之后的hive hook相关文件夹拷贝到hive客户端
<br>我自己编译atlas项目之后的路径如下<br>D:\workspace\idea\atlas\distro\target\apache-atlas-2.1.0-hive-hook\apache-atlas-hive-hook-2.1.0
修改hive-site.xml 配置
<property> <br> <name>hive.exec.post.hooks</name><br> <value>org.apache.atlas.hive.hook.HiveHook</value><br> </property><br><br> <property><br> <name>hive.metastore.event.listeners</name><br> <value>org.apache.atlas.hive.hook.HiveMetastoreHook</value><br> </property>
增加jar包环境,修改hive-env.sh
export HIVE_AUX_JARS_PATH=/usr/local/hive/hook/hive
hive客户端环境
[root@hadoop01 ~]# which hive<br>/usr/local/hive/bin/hive
[root@hadoop01 ~]# cd /usr/local/hive/<br>[root@hadoop01 hive]# pwd<br>/usr/local/hive
[root@hadoop01 hive]# ll<br>总用量 11044<br>drwxr-xr-x 3 root root 179 8月 6 13:55 bin<br>drwxr-xr-x 2 root root 4096 9月 2 20:35 conf<br>drwxr-xr-x 4 root root 34 3月 24 16:10 examples<br>drwxr-xr-x 7 root root 68 3月 24 16:10 hcatalog<br>drwxr-xr-x 3 root root 18 9月 3 09:59 hook<br>drwxr-xr-x 2 root root 28 9月 3 09:59 hook-bin<br>-rw-r--r-- 1 root root 2040 9月 3 10:04 hook-bin.zip<br>-rw-r--r-- 1 root root 11251678 9月 3 10:04 hook.zip
[root@hadoop01 hive]# ll hook-bin<br>总用量 8<br>-rw-r--r-- 1 root root 4246 8月 19 10:38 import-hive.sh
[root@hadoop01 hive]# ll hook<br>总用量 0<br>drwxr-xr-x 3 root root 112 9月 3 09:59 hive
[root@hadoop01 hive]# ll hook/hive/<br>总用量 36<br>drwxr-xr-x 2 root root 4096 9月 3 09:59 atlas-hive-plugin-impl<br>-rw-r--r-- 1 root root 17506 9月 3 09:58 atlas-plugin-classloader-2.1.0.jar<br>-rw-r--r-- 1 root root 11563 9月 3 09:58 hive-bridge-shim-2.1.0.jar
[root@hadoop01 hive]# ll hook/hive/atlas-hive-plugin-impl/<br>总用量 12260<br>-rw-r--r-- 1 root root 37495 9月 3 09:51 atlas-client-common-2.1.0.jar<br>-rw-r--r-- 1 root root 42189 9月 3 09:51 atlas-client-v1-2.1.0.jar<br>-rw-r--r-- 1 root root 22362 9月 3 09:51 atlas-client-v2-2.1.0.jar<br>-rw-r--r-- 1 root root 79688 9月 3 09:51 atlas-common-2.1.0.jar<br>-rw-r--r-- 1 root root 559518 9月 3 09:51 atlas-intg-2.1.0.jar<br>-rw-r--r-- 1 root root 64144 9月 3 09:51 atlas-notification-2.1.0.jar<br>-rw-r--r-- 1 root root 362679 6月 30 14:04 commons-configuration-1.10.jar<br>-rw-r--r-- 1 root root 96551 9月 3 09:58 hive-bridge-2.1.0.jar<br>-rw-r--r-- 1 root root 66897 7月 29 14:41 jackson-annotations-2.9.9.jar<br>-rw-r--r-- 1 root root 325632 7月 29 14:41 jackson-core-2.9.9.jar<br>-rw-r--r-- 1 root root 1400944 6月 15 21:47 jackson-databind-2.10.0.jar<br>-rw-r--r-- 1 root root 165345 6月 15 22:02 jersey-json-1.19.jar<br>-rw-r--r-- 1 root root 53275 7月 29 14:40 jersey-multipart-1.19.jar<br>-rw-r--r-- 1 root root 45927 7月 29 19:18 jsr311-api-1.1.jar<br>-rw-r--r-- 1 root root 7295202 7月 29 14:40 kafka_2.11-2.0.0.jar<br>-rw-r--r-- 1 root root 1893564 7月 29 14:40 kafka-clients-2.0.0.jar
[root@hadoop01 hive]# cat conf/hive-env.sh | grep -i HIVE_AUX_JARS_PATH<br>export HIVE_AUX_JARS_PATH=/usr/local/hive/hook/hive
[root@hadoop01 hive]# cat conf/hive-site.xml | grep -C3 -i atlas<br><property><br> <name>hive.exec.post.hooks</name><br> <value>org.apache.atlas.hive.hook.HiveHook</value><br> </property><br><br><property><br> <name>hive.metastore.event.listeners</name><br> <value>org.apache.atlas.hive.hook.HiveMetastoreHook</value><br> </property>
离线导入hive库表
hook-bin/import-hive.sh
hive实时hook之hive driver端
启动hive客户端
[root@hadoop01 hive]# hive<br>hive> drop database db0903;<br>......<br>2021-09-03 10:40:49,917 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(554)) - Compiling command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b): drop database db0903<br>2021-09-03 10:40:49,933 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:<init>(177)) - 222==============================<br>2021-09-03 10:40:49,934 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:checkConcurrency(285)) - Concurrency mode is disabled, not creating a lock manager<br>2021-09-03 10:40:49,966 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(666)) - Semantic Analysis Completed (retrial = false)<br>2021-09-03 10:40:49,966 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:getSchema(374)) - Returning Hive schema: Schema(fieldSchemas:null, properties:null)<br>2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:compile(781)) - Completed compiling command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b); Time taken: 0.05 seconds<br>2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] reexec.ReExecDriver (ReExecDriver.java:run(156)) - Execution #1 of query<br>2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:checkConcurrency(285)) - Concurrency mode is disabled, not creating a lock manager<br>2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:execute(2255)) - Executing command(queryId=root_20210903104049_5a3a1ef1-f59b-45e6-a92c-76792707227b): drop database db0903<br>2021-09-03 10:40:49,967 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] ql.Driver (Driver.java:launchTask(2662)) - Starting task [Stage-0:DDL] in serial mode<br>2021-09-03 10:40:50,640 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(186)) - 222==============================queryStr:drop database db0903<br>2021-09-03 10:40:50,640 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(187)) - 222==============================LINKIS.SUBMIT.USER:null<br>2021-09-03 10:40:50,641 INFO [cdf9444a-28da-4c7f-9edd-0c73fc26a753 main] hook.HiveHook (HiveHook.java:run(188)) - 222==============================LINKIS.TASK.NAME:null
[root@hadoop01 hive]# ps -ef | grep -i hive
root 21130 5116 2 10:30 pts/3 00:00:08 /usr/jdk1.8.0_191/bin/java -Dproc_jar -Djava.net.preferIPv4Stack=true -Dproc_hivecli -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/usr/local/hive/conf/parquet-logging.properties -Dyarn.log.dir=/usr/local/hadoop-3.2.1/logs -Dyarn.log.file=hadoop.log -Dyarn.home.dir=/usr/local/hadoop-3.2.1 -Dyarn.root.logger=INFO,console -Djava.library.path=/usr/local/hadoop-3.2.1/lib/native -Xmx256m -Dhadoop.log.dir=/usr/local/hadoop-3.2.1/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/local/hadoop-3.2.1 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/local/hive/lib/hive-cli-3.1.2.jar org.apache.hadoop.hive.cli.CliDriver --hiveconf hive.aux.jars.path=file:///home/usr_local/hive/hook/hive/atlas-plugin-classloader-2.1.0.jar,file:///home/usr_local/hive/hook/hive/hive-bridge-shim-2.1.0.jar
hive实时hook之hive server2服务端
hive实时hook之hive metastoreserver服务端
配置好之后,重启metastore服务,可以看到加载了相关配置文件和atlas相关的类
[root@hadoop01 hook]# hive --service metastore<br>......<br>2021-09-03 09:06:02,086 INFO [main] conf.MetastoreConf (MetastoreConf.java:findConfigFile(1240)) - Found configuration file file:/home/usr_local/hive/conf/hive-site.xml<br>2021-09-03 09:06:03,040 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:startupShutdownMessage(9236)) - STARTUP_MSG: <br>/************************************************************<br>STARTUP_MSG: Starting HiveMetaStore<br>STARTUP_MSG: host = hadoop01/172.24.2.232<br>STARTUP_MSG: args = []<br>STARTUP_MSG: version = 3.1.2<br>STARTUP_MSG: classpath = /usr/local/hive/conf:/home/usr_local/hive/hook/hive/atlas-plugin-classloader-2.1.0.jar:/home/usr_local/hive/hook/hive/hive-bridge-shim-2.1.0.jar:/<br>......<br>2021-09-03 09:06:05,550 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:get(121)) - Looking for atlas-application.properties in classpath<br>2021-09-03 09:06:05,550 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:get(134)) - Loading atlas-application.properties from file:/home/usr_local/hive/conf/atlas-application.properties<br>2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(314)) - Using graphdb backend 'janus'<br>2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(325)) - Using storage backend 'hbase2'<br>2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(336)) - Using index backend 'elasticsearch'<br>2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefaults(360)) - Setting atlas.graph.index.search.max-result-set-size = 150<br>2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache = true<br>2021-09-03 09:06:05,568 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache-clean-wait = 20<br>2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.db-cache-size = 0.5<br>2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.tx-cache-size = 15000<br>2021-09-03 09:06:05,569 INFO [main] atlas.ApplicationProperties (ApplicationProperties.java:setDefault(372)) - Property (set to default) atlas.graph.cache.tx-dirty-size = 120<br>2021-09-03 09:06:05,583 INFO [main] kafka.KafkaNotification (KafkaNotification.java:<init>(115)) - ==> KafkaNotification()<br>2021-09-03 09:06:05,592 INFO [main] kafka.KafkaNotification (KafkaNotification.java:<init>(149)) - <== KafkaNotification()<br>2021-09-03 09:06:05,624 INFO [main] hook.AtlasHook (AtlasHook.java:<clinit>(141)) - Created Atlas Hook<br>2021-09-03 09:06:05,628 INFO [main] hook.HiveHook (HiveHook.java:<init>(177)) - 222==============================<br>2021-09-03 09:06:05,679 INFO [main] conf.HiveConf (HiveConf.java:findConfigFile(187)) - Found configuration file file:/home/usr_local/hive/conf/hive-site.xml<br><br>后台守护方式启动metastore服务<br>nohup hive --service metastore >/usr/local/hive/metastore.log 2>&1 &
SQL测试实例
SET LINKIS.SUBMIT.USER=suyc;<br>SET LINKIS.TASK.NAME=ws01-pj01-flow01;<br><br>create database db0903;<br><br>create table db0903.person01(<br>id INT,<br>name STRING,<br>age INT<br>)<br>ROW FORMAT DELIMITED <br>FIELDS TERMINATED BY ',' <br>COLLECTION ITEMS TERMINATED BY '-'<br>MAP KEYS TERMINATED BY ':' <br>LINES TERMINATED BY '\n';<br><br>create table db0903.person02(<br>id INT,<br>name STRING,<br>age INT<br>)<br>ROW FORMAT DELIMITED <br>FIELDS TERMINATED BY ',' <br>COLLECTION ITEMS TERMINATED BY '-'<br>MAP KEYS TERMINATED BY ':' <br>LINES TERMINATED BY '\n';<br><br>insert overwrite table db0903.person02 select * from db0903.person01;<br> <br>create table db0903.person03 as select * from db0903.person02;
hive hook二次开发
hive client远程debug
https://www.cnblogs.com/songchaolin/p/13084252.htm
hive --debug 客户端启动远程debug监听
本地idea进行远程debug连接
需求
采集hive相关元数据时需要加入最上层的业务信息,将业务元数据与技术元数据关联起来<br>
hive driver客户端端hook的原理
hive-site.xml 添加的hook配置
<property> <br> <name>hive.exec.post.hooks</name><br> <value>org.apache.atlas.hive.hook.HiveHook</value><br> </property>
atlas项目中源代码入口位置
addons/hive-bridge-shim/src/main/java/org/apache/atlas/hive/hook/HiveHook.java
原理Hive Hook
hive抽象类<br>org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext<br>atlas具体实现<br>addons/hive-bridge-shim/src/main/java/org/apache/atlas/hive/hook/HiveHook.java<br>addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveHook.java
hive server2服务端hook的原理
hive metestore服务端hook的原理
hive-site.xml 添加的hook配置
<property><br> <name>hive.metastore.event.listeners</name><br> <value>org.apache.atlas.hive.hook.HiveMetastoreHook</value><br> </property>
atlas项目中源代码入口位置
addons/hive-bridge-shim/src/main/java/org/apache/atlas/hive/hook/HiveMetastoreHook.java
原理 Hive Listener
hive抽象类<br>org.apache.hadoop.hive.metastore.MetaStoreEventListener <br>atlas具体实现<br>addons/hive-bridge/src/main/java/org/apache/atlas/hive/hook/HiveMetastoreHookImpl.java
具体实现
集成spark2.4.x测试
存在问题
SAC hook支持的spark版本问题<br>目前只支持spark2.4.x,不支持spark3.x
SAC hook搜集的血缘信息不全<br>实际测试之后,验证了如下结论:<br>Known Limitations (Design decision)<br>SAC only supports SQL/DataFrame API (in other words, SAC doesn't support RDD). <br>All "inputs" and "outputs" in multiple queries are accumulated into single "spark_process" entity when there're multple queries running in single Spark session. <br>SAC classifies table related entities with two different kind of models: Spark / Hive.<br>We decided to skip sending create events for Hive tables managed by HMS to avoid duplication of those events from Atlas hook for Hive . For Hive entities, Atlas relies on Atlas hook for Hive as the source of truth.
spark ddl不在spark hook中搜集,需要开启hive metastore hook来采集
SAC
https://github.com/hortonworks-spark/spark-atlas-connector<br>特别说明: 经过实际测试,SAC只支持spark2.4.x版本,不支持spark3.x版本,java class不兼容
编译SAC
[root@hadoop03 spark-atlas-connector]# pwd<br>/home/atlas/spark-atlas-connector
[root@hadoop03 spark-atlas-connector]# git status<br># 位于分支 master<br>无文件要提交,干净的工作区
[root@hadoop03 spark-atlas-connector]# mvn clean<br>[root@hadoop03 spark-atlas-connector]# mvn package -DskipTests<br>忽略test编译时的错误: mvn clean package -Dmaven.test.skip=true <br>......
[root@hadoop03 spark-atlas-connector]# ll spark-atlas-connector-assembly/target/<br>drwxr-xr-x 2 root root 28 8月 5 18:10 antrun<br>drwxr-xr-x 2 root root 28 8月 5 18:10 maven-archiver<br>-rw-r--r-- 1 root root 2803 8月 5 18:10 original-spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar<br>drwxr-xr-x 4 root root 41 8月 5 18:10 scala-2.11<br>-rw-r--r-- 1 root root 41679846 8月 5 18:10 spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar<br>drwxr-xr-x 2 root root 6 8月 5 18:10 tmp
安装spark gateway并测试sac
atlas服务部署
在hadoop04机器,我们已经部署好atlas服务(包括一整套zk+kafka+hbase+es)
基础环境
[root@hadoop01 usr_local]# which java<br>/usr/jdk1.8.0_191/bin/java<br>[root@hadoop01 usr_local]# which hadoop<br>/usr/local/hadoop-3.2.1/bin/hadoop<br>[root@hadoop01 usr_local]# which hive<br>/usr/local/hive/bin/hive<br>[root@hadoop01 usr_local]# which spark<br>/usr/bin/which: no spark in (/usr/local/scala/bin:/usr/local/flink/bin:/usr/local/apache-maven-3.6.1/bin:/usr/jdk1.8.0_191/bin:/usr/jdk1.8.0_191/jre/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/hadoop-3.2.1/bin:/usr/local/hive/bin:/usr/local/spark/bin:/usr/local/hadoop-3.2.1/etc/hadoop:/root/bin)<br>[root@hadoop01 usr_local]# echo $SPARK_HOME<br>/usr/local/spark<br>
atlas文件准备
[root@hadoop01 ~]# ll /home/atlas_files/<br>-rw-r--r-- 1 root root 12332 8月 6 13:41 atlas-application.properties<br>-rw-r--r-- 1 root root 41679846 8月 6 13:41 spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar
spark gateway部署及配置文件准备
[root@hadoop01 usr_local]# pwd<br>/home/usr_local<br>[root@hadoop01 usr_local]# unzip spark-2.4.7.zip <br><br>[root@hadoop01 spark-2.4.7]# pwd<br>/home/usr_local/spark-2.4.7<br>[root@hadoop01 spark-2.4.7]# cp /usr/local/hive/conf/hive-site.xml ./conf/<br><br>[root@hadoop01 spark-2.4.7]# ll conf/ | grep -i "atlas\|hive"<br>-rw-r--r-- 1 root root 12332 8月 6 13:35 atlas-application.properties<br>-rw-r--r-- 1 root root 2212 8月 6 13:37 hive-site.xml<br><br>[root@hadoop01 spark-2.4.7]# cat conf/spark-env.sh | grep -iv "#"<br>export JAVA_HOME=/usr/jdk1.8.0_191/<br>export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop-3.2.1/bin/hadoop classpath)<br>export HADOOP_CONF_DIR=/usr/local/hadoop-3.2.1/conf<br>export HIVE_HOME=/usr/local/hive<br><br>[root@hadoop01 ~]# ln -s /home/usr_local/spark-2.4.7 /usr/local/spark2.4.7<br>[root@hadoop01 ~]# cd /usr/local/spark2.4.7/<br>[root@hadoop01 spark2.4.7]# pwd<br>/usr/local/spark2.4.7<br><br>-----------已有其他spark环境,我们需要测试自己的版本<br>[root@hadoop01 spark2.4.7]# echo $SPARK_HOME<br>/usr/local/spark<br>[root@hadoop01 spark2.4.7]# export SPARK_HOME=/usr/local/spark2.4.7<br>[root@hadoop01 spark2.4.7]# echo $SPARK_HOME<br>/usr/local/spark2.4.7
测试
-----------不带hook启动spark client<br>[root@hadoop01 spark2.4.7]# bin/spark-shell --master yarn<br>[root@hadoop01 spark2.4.7]# bin/spark-sql --master yarn<br>
-----------带hook启动spark client<br>bin/spark-sql --master yarn --executor-memory 1G --executor-cores 1 \<br>--files /home/atlas_files/atlas-application.properties \<br>--jars /home/atlas_files/spark-atlas-connector-assembly-0.1.0-SNAPSHOT.jar \<br>--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \<br>--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker
-----------spark hook忽略掉了<br>spark-sql> create database db_suyc;
-----------spark hook忽略掉了<br>CREATE TABLE db_suyc.person22(<br>id INT,<br>name STRING,<br>age INT<br>)<br>ROW FORMAT DELIMITED <br>FIELDS TERMINATED BY ',' <br>COLLECTION ITEMS TERMINATED BY '-'<br>MAP KEYS TERMINATED BY ':' <br>LINES TERMINATED BY '\n';
-----------spark hook捕获了,并实时发送给atlas<br>spark-sql> create table db_suyc.person33 as select * from db_suyc.person22;
-----------spark hook捕获了,并实时发送给atlas<br>spark-sql> insert into default.demo_01 select * from default.demo_02;
远程debug
问题
希望动态化地深入理解atlas的原理源码;
解决方案一——本地调试环境<br>
https://my.oschina.net/u/4286379/blog/4329390
atlas不适合部署在win系统上,而个人电脑又恰好是win10,所以本地调试环境不好搭建
子主题解放方案二——远程debug方式
atlas服务开启远程debug模式
我修改了bin/atlas_start.py文件,只修改了一行代码:<br>#DEFAULT_JVM_OPTS="-Dlog4j.configuration=atlas-log4j.xml -Djava.net.preferIPv4Stack=true -server"<br>DEFAULT_JVM_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=9999 -Dlog4j.configuration=atlas-log4j.xml -Djava.net.preferIPv4Stack=true -server"
本地idea进行远程debug
https://www.cnblogs.com/wy2325/p/5600232.html
需要注意远程主机的IP和PORT要填写正确
然后像本地调试一样设置断点<br>
关注的调试点<br>
atlas对外暴露的REST接口
断点插入代码位置:<br>webapp/src/main/java/org/apache/atlas/web/rest<br>触发调试方法:<br>使用postman进行http请求测试
atlas消费ATLAS_HOOK这个kafka topic中hook message的逻辑
断点插入代码位置:<br>webapp/src/main/java/org/apache/atlas/notification/NotificationHookConsumer.java<br>webapp/src/main/java/org/apache/atlas/notification/preprocessor/HivePreprocessor.java<br>触发调试方法:<br>执行hive sql,触发hive hook推送消息到kafka topic;<br>或执行spark sql,触发spark hook推送消息到kafka topic;
hive hook执行逻辑
hive client远程debug<br>https://www.cnblogs.com/songchaolin/p/13084252.html<br>hive server 2 服务端及beeline client客户端 远程debug方式<br>https://blog.csdn.net/merrily01/article/details/105725414/
spark hook执行逻辑
spark client远程debug<br>https://blog.csdn.net/asfjgvajfghaklsbf/article/details/109671367<br>
REST API总结及测试
背景
我们可以通过Atlas对外暴露的REST API来与其进行交互,从而实现元数据的增删改查<br>必要时,我们可以对其进行二次开发,来适配我们的需求
参考文档
对外REST API
源码<br>webapp/src/main/java/org/apache/atlas/web/rest<br>
官方API文档<br>http://atlas.apache.org/api/v2/
swagger文档<br>http://atlas.apache.org/api/v2/ui/index.html#/
其它参考<br>https://www.jianshu.com/p/a37ae460986f<br>https://blog.csdn.net/wangpei1949/article/details/87891862<br>https://marcel-jan.eu/datablog/2019/09/03/the-atlas-rest-api-working-examples/
对外Java API
源码<br>client/client-v2/src/main/java/org/apache/atlas/AtlasClientV2.java<br>本质上是对REST API的封装
API总结及测试
AdminREST
----------------------------AdminREST查看Atlas Metadata Server节点状态 GET /admin/status<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/admin/status"<br><br>查看Atlas版本和描述 GET /admin/version<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/admin/version”
DiscoveryREST<br>
#查询所有Hive库<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_db"<br><br>#查询所有Hive表<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_table"<br> <br>#查询所有Hive表,且包含某一关键字<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?typeName=hive_table&query=ads_gmv_sum_day”<br><br>#查询所有Hive库<br>http://hadoop04:21000/api/atlas/entities?type=hive_db<br><br>#查询所有Hive表<br>http://hadoop04:21000/api/atlas/entities?type=hive_table
TypesREST
检索所有Type,并返回所有信息 GET /v2/types/typedefs<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/types/typedefs"<br><br>检索所有Type,并返回最少信息 GET /v2/types/typedefs/headers<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/types/typedefs/headers"
EntityREST
查询某个表的GUID<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/search/basic?query=gdyinfo_new&typeName=hive_table"<br><br>curl -s -u admin:admin "http://hadoop04:21000/v2/entity/uniqueAttribute/type/hive_table?attr:qualifiedName=default.demo_02@primary"<br><br><br>批量根据GUID检索Entity GET /v2/entity/bulk<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/bulk?minExtInfo=yes&guid=2dd4ca4c-9d33-4c19-bca3-f60e162debf2”<br><br>获取某个Entity定义 GET /v2/entity/guid/{guid}<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/bulk?minExtInfo=yes&guid=2dd4ca4c-9d33-4c19-bca3-f60e162debf2"<br><br>获取某个Entity的TAG列表 GET /v2/entity/guid/{guid}/classifications<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/entity/guid/2dd4ca4c-9d33-4c19-bca3-f60e162debf2/classifications"<br><br><br>获取一个包含某个attribute(属性)的entity:<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/entities?<br>type={type_name}&property={unique_attribute_name}&value={unique_attribute_value}<br><br>更新entity的一个attribute属性<br>PUT <br>http://hadoop04:21000/api/atlas/v2/entity/guid/0e822d4c-a578-4b0a-b9e6-085096fbf92f?name=comment<br>"这是一个测试的表 by suyc"
LineageREST
查询某个Entity的Lineage GET /v2/lineage/{guid}<br>curl -s -u admin:admin "http://hadoop04:21000/api/atlas/v2/lineage/2dd4ca4c-9d33-4c19-bca3-f60e162debf2"
<br>生成血缘数据<br>通过Atlas的RestAPI接口新增Process,可以生成血缘数据。<br>例如将Atlas元数据管理的MySQL数据库表和hive数据表关联生成血缘数据,<br>先查到两张表的guid值,然后构造请求数据调用<br>接口:http://{atlas_host}:21000/api/atlas/v2/entity/bulk<br>请求消息:<br>{<br> "entities":[<br> {<br> "typeName":"Process",<br> "attributes":{<br> "owner":"root",<br> "createTime":"2020-05-07T10:32:21.0Z",<br> "updateTime":"",<br> "qualifiedName":"people@process@mysql://192.168.1.1:3306",<br> "name":"peopleProcess",<br> "description":"people Process",<br> "comment":"test people Process",<br> "contact_info":"jdbc",<br> "type":"table",<br> "inputs":[<br> {<br> "guid":"5a676b74-e058-4e81-bcf8-42d73f4c1729",<br> "typeName":"rdbms_table"<br> }<br> ],<br> "outputs":[<br> {<br> "guid":"2e7c70e1-5a8a-4430-859f-c46d267e33fd",<br> "typeName":"hive_table"<br> }<br> ]<br> }<br> }<br> ]<br>}
通过API方式集成RDBMS<br>
创建rdbms_instance、rdbms_db、rdbms_column、rdbms_table<br>https://www.codeleading.com/article/29371584292/<br><br>请求方式:Post<br>请求路径:http://hadoop04:21000/api/atlas/v2/entity<br>请求验证方式:BaseAuth admin/admin
0 条评论
下一页