join详细设计
<br><br><br><br>腾讯算法比赛<br>1 user_installedapps.csv 和 user_app_action.csv根据userID和appID进行join 得到(userID,appID,installTime的hive表temps1)<br><br> hiveContext.sql("select user_app_actions.userID,user_app_actions.installTime,user_app_actions.appID from user_app_actions join user_installedapps where user_app_actions.userID = user_installedapps.userID AND user_app_actions.appID=user_installedapps.appID”)<br><br>2 temps1表 根据userID和user.csv进行join 得到temps2表<br>(userID,appID,installTime,age,gender,education,marriageStatus,haveBaby,hometown,residence)<br><br><br>hiveContext.sql("select user.userID,user.age,user.gender,user.education,user.marriageStatus,user.haveBaby,user.hometown,user.residence,temps1.appID,temps1.installTime from user join temps1 where user.userID=temps1.userID”)<br><br>3 temps2表 根据 appID 和app_categories.csv对应的表进行join操作得到tems3表 <br><br>hiveContext.sql("select temps2.userID,temps2.age,temps2.gender,temps2.education,temps2.marriageStatus,temps2.haveBaby,temps2.hometown,temps2.residence,temps2.appID,temps2.installTime,app_categories.appCategory from app_categories join temps2 where temps2.appID=app_categories.appID”)<br><br>[userID,bigint,null]<br>[age,int,null]<br>[gender,int,null]<br>[education,int,null]<br>[marriageStatus,int,null]<br>[haveBaby,int,null]<br>[hometown,int,null]<br>[residence,int,null]<br>[appID,int,null]<br>[installTime,int,null]<br>[appCategory,int,null]<br><br>4 temps3 表根据userID和train.csv对应的表进行join 得到temps4表<br><br>hiveContext.sql("select temps3.userID,temps3.age,temps3.gender,temps3.education,temps3.marriageStatus,temps3.haveBaby,temps3.hometown,temps3.residence,temps3.appID,temps3.installTime,temps3.appCategory,train.label,train.clickTime,train.conversionTime,train.positionID,train.connectionType,train.telecomsOperator from train join temps3 where train.userID=temps3.userID")<br><br><br>5 temps4 表根据 positionID 和position.csv对应的表进行join 得到temps5<br><br>hiveContext.sql("select temps4.userID,temps4.age,temps4.gender,temps4.education,temps4.marriageStatus,temps4.haveBaby,temps4.hometown,temps4.residence,temps4.appID,temps4.installTime,temps4.appCategory,temps4.label,temps4.clickTime,temps4.conversionTime,temps4.positionID,temps4.connectionType,temps4.telecomsOperator,position.sitesetID,position.positionType from position join temps4 where temps4.positionID=position.positionID")<br><br><br>6 temps5 表根据 appID和ad.csv对应的表进行join操作<br><br><br>hiveContext.sql("select temps5.userID,temps5.age,temps5.gender,temps5.education,temps5.marriageStatus,temps5.haveBaby,temps5.hometown,temps5.residence,temps5.appID,temps5.installTime,temps5.appCategory,temps5.label,temps5.clickTime,temps5.conversionTime,temps5.positionID,temps5.connectionType,temps5.telecomsOperator,temps5.sitesetID,temps5.positionType,ad.creativeID,ad.adID,ad.camgaignID,ad.advertiserID,ad.appPlatform from ad join temps5 where temps5.appID=ad.appID")<br><br><br><br>结果表 lasttemps<br>[userID,bigint,null]<br>[age,int,null]<br>[gender,int,null]<br>[education,int,null]<br>[marriageStatus,int,null]<br>[haveBaby,int,null]<br>[hometown,int,null]<br>[residence,int,null]<br>[appID,int,null]<br>[installTime,int,null]<br>[appCategory,int,null]<br>[label,int,null]<br>[clickTime,int,null]<br>[conversionTime,int,null]<br>[positionID,int,null]<br>[connectionType,int,null]<br>[telecomsOperator,int,null]<br>[sitesetID,int,null]<br>[positionType,int,null]<br>[creativeID,int,null]<br>[adID,int,null]<br>[camgaignID,int,null]<br>[advertiserID,int,null]<br>[appPlatform,int,null]<br><br><br><br><br>7 预处理<br><br>训练集合进行数据过滤,降噪,降维,特征选取<br><br><br><br>8 模型训练<br><br><br>9 模型预测<br><br><br>10 模型评估<br><br><br>11 模型优化<br><br>12 提交结果
建总表
hiveContext.sql("create table newalltables(userID int,age int,gender int,education int,marriageStatus int,haveBaby int,hometown int,residence int,appID int,installTime int,appCategory int,label int,clickTime int,conversionTime int,positionID int,connectionType int,telecomsOperator int,sitesetID int,positionType int,creativeID int,adID int,camgaignID int,advertiserID int,appPlatform int) location '/usr/local/bigdata/tencentData/pre/newalltables'")
插入数据
scala> hiveContext.sql("insert into table newalltables select userID,age,gender,education,marriageStatus,haveBaby,hometown,residence,appID,installTime,appCategory,label,clickTime,conversionTime,positionID,connectionType,telecomsOperator,sitesetID,positionType,creativeID,adID,camgaignID,advertiserID,appPlatform from lasttemps")
test 表聚合
1 hiveContext.sql("select test.instanceID,test.userID,test.positionID,test.connectionType,test.clickTime,test.creativeID,test.telecomsOperator,user.age,user.gender,user.education,user.marriageStatus,user.haveBaby,user.hometown,user.residence from test join user where user.userID=test.userID") res12.createOrReplaceTempView("testtemps1")<br> 得到临时表 testtemps1
2 通过 testtemps1 临时表join ad表 :过程如下所示 (根据creativeID来join) 得到 testtemps3
hiveContext.sql("select testtemps1.instanceID,testtemps1.userID,testtemps1.positionID,testtemps1.connectionType,testtemps1.clickTime,testtemps1.creativeID,testtemps1.telecomsOperator,testtemps1.age,testtemps1.gender,testtemps1.education,testtemps1.marriageStatus,testtemps1.haveBaby,testtemps1.hometown,testtemps1.residence,ad.adID,ad.camgaignID,ad.advertiserID,ad.appID,ad.appPlatform from testtemps1 join ad where ad.creativeID=testtemps1.creativeID")
join (根据positionID) join position表 得到 testtemp4
hiveContext.sql("select testtemps3.instanceID,testtemps3.userID,testtemps3.positionID,testtemps3.connectionType,testtemps3.clickTime,testtemps3.creativeID,testtemps3.telecomsOperator,testtemps3.age,testtemps3.gender,testtemps3.education,testtemps3.marriageStatus,testtemps3.haveBaby,testtemps3.hometown,testtemps3.residence,testtemps3.adID,testtemps3.camgaignID,testtemps3.advertiserID,testtemps3.appID,testtemps3.appPlatform,position.sitesetID,position.positionType from testtemps3 join position where position.positionID=testtemps3.positionID")
testtemp4 join app_categories(根据appID join)
hiveContext.sql("select testtemps4.instanceID,testtemps4.userID,testtemps4.positionID,testtemps4.connectionType,testtemps4.clickTime,testtemps4.creativeID,testtemps4.telecomsOperator,testtemps4.age,testtemps4.gender,testtemps4.education,testtemps4.marriageStatus,testtemps4.haveBaby,testtemps4.hometown,testtemps4.residence,testtemps4.adID,testtemps4.camgaignID,testtemps4.advertiserID,testtemps4.appID,testtemps4.appPlatform,testtemps4.sitesetID,testtemps4.positionType,app_categories.appCategory from testtemps4 join app_categories where app_categories.appID=testtemps4.appID")
hiveContext.sql("create table lasttest2(instanceID int,userID int,positionID int,connectionType int,clickTime int,creativeID int,telecomsOperator int,age int,gender int,education int,marriageStatus int,haveBaby int,hometown int,residence int,adID int,camgaignID int,advertiserID int,appID int,appPlatform int,sitesetID int,positionType int,appCategory int) location '/usr/local'")