-
Notifications
You must be signed in to change notification settings - Fork 39
dataprocess方法介绍.html
方法作用:
调用此方法可以将不合法的记录去除掉。
方法签名:
void formatRec(String spStr, int fdSum, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号,
fdSum:字段数量(不符合该数量的记录将被清除),
srcDirName:源目录名,
dstDirName输出目录名,输出目录如果存在将会覆盖,
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000,
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码。
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。不足8列的数据为不合法数据,应用formatRec可将不合法数据过滤掉,只筛选出合法数据
程序清单:
import com.dksou.fitting.dataprocess.service.FormatRecService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class FormatRecClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
FormatRecService.Client client;
TTransport tTransport;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol formatRecService = new TMultiplexedProtocol(tProtocol, "FormatRecService");
client=new FormatRecService.Client( formatRecService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void formatRec() throws TException {
String hostIp = "192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
//数据分隔符
String spStr = ",";
int fdSum = 8;
String srcDirName = "/test/in/";
String dstDirName = "/test/out";
System.out.print("---------------------");
//对不符合规则的数据进行清洗,得到符合字段数目的数据
client.formatRec(spStr, fdSum, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword);
System.out.print("结束");
}
}
方法作用:
用此方法可以按关键字过滤出想要的字段。
方法签名:
oid formatField(String spStr, int fdSum, String fdNum, String regExStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号;
fdSum:字段数量;
fdNum:字段序号(检查哪个字段是否符合正则,0为全部检查),可为一个或多个,多个之间用逗号分隔(1,2,3...);
regExStr:字段中包含该字符的记录将被剔除(a,b,c),与字段序号相对应,多个字段时每个字段都符合该条件的记录将被剔除;
srcDirName:源目录名;
dstDirName输出目录名,输出目录如果存在将会覆盖;
hostIp:要连接hiveserver主机的ip地址;
hostPort:hiveserver的端口,默认10000;
hostName:连接HIVE使用的用户名 例如 root;
hostPassword:连接HIVE用户使用的密码;
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中查看除了一年级之外其他年级的成绩,可以用formatField将一年级的数据过滤掉
程序清单:
import com.dksou.fitting.dataprocess.service.FormatFieldService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class FormatFieldClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
FormatFieldService.Client client;
TTransport tTransport;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol formatFieldService = new TMultiplexedProtocol(tProtocol, "FormatFieldService");
client=new FormatFieldService.Client( formatFieldService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void formatField() throws TException {
String hostIp = "192.168.60.129";
String hostName = "root";
String hostPassword = "123456";
int fdSum = 8;
//数据分隔符
String spStr = ",";
String srcDirName = "/test/in/";
String dstDirName = "/test/out";
String hostPort = "10000";
String fdNum = "8";
String regExStr = "2";
System.out.print("---------------------");
//对不符合规则的数据进行清洗,得到符合字段数目的数据
client.formatField(spStr,fdSum,fdNum,regExStr, srcDirName,dstDirName, hostIp, hostPort, hostName, hostPassword);
System.out.print("结束");
}
}
方法作用:
调用此方法可以从所有字段中筛选出想要的几个字段数据。
方法签名:
void selectField(String spStr, int fdSum, String fdNum, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号;
fdSum:字段数量;
fdNum:字段数组(整数数组,内容是要保留的字段序号,没有编号的字段将去除),输入格式:逗号分隔的数字(1,2,3...);
srcDirName:源目录名;
dstDirName输出目录名,输出目录如果存在将会覆盖;
hostIp:要连接hiveserver主机的ip地址;
hostPort:hiveserver的端口,默认10000;
hostName:连接HIVE使用的用户名 例如 root;
hostPassword:连接HIVE用户使用的密码;
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中查看学生姓名及其家长姓名及联系方式,可用selectField只筛选出想查看的列中的信息
程序清单:
import com.dksou.fitting.dataprocess.service.SelectFieldService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class SelectFieldClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
SelectFieldService.Client client;
TTransport tTransport;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol selectFieldService = new TMultiplexedProtocol(tProtocol, "SelectFieldService");
client=new SelectFieldService.Client( selectFieldService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void selectField() throws TException {
String hostIp = "192.168.60.129";
String hostName = "root";
String hostPassword = "123456";
int fdSum = 8;
//数据分隔符
String spStr = ",";
String srcDirName = "/test/in/";
String dstDirName = "/test/out/";
String hostPort = "10000";
String fdNum = "1,2,3";
System.out.print("---------------------");
//对不符合规则的数据进行清洗,得到符合字段数目的数据
client.selectField(spStr, fdSum, fdNum, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword);
System.out.print("结束");
}
}
方法作用:
调用此方法可以筛选出符合条件的记录条数。
方法签名:
void selectRec(String spStr, int fdSum, String whereStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号;
fdSum:字段数量;
whereStr:比较条件 f1 >= 2 and (f2=3 or f3=4),f1为第一个字段;
srcDirName:源目录名;
dstDirName输出目录名,输出目录如果存在将会覆盖;
hostIp:要连接hiveserver主机的ip地址;
hostPort:hiveserver的端口,默认10000;
hostName:连接HIVE使用的用户名 例如 root;
hostPassword:连接HIVE用户使用的密码;
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中查看语文成绩小于60分的学生信息,可用selectRec限定条件进行筛选
程序清单:
import com.dksou.fitting.dataprocess.service.SelectRecService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class SelectRecClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
SelectRecService.Client client;
TTransport tTransport;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol selectRecService = new TMultiplexedProtocol(tProtocol, "SelectRecService");
client=new SelectRecService.Client( selectRecService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void selectField() throws TException {
String hostIp = "192.168.60.129";
String hostName = "root";
String hostPassword = "123456";
int fdSum = 8;
String spStr =",";
String whereStr = "f5='语文' and f6 < '60'";
String srcDirName ="/test/in";
String dstDirName ="/test/out";
String hostPort ="10000";
System.out.print("---------------------");
client.selectRec(spStr, fdSum, whereStr, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword);
System.out.print("结束");
}
}
方法作用:
该方法可筛选出不同的数据或字段。
方法签名:
void dedup(String spStr, String fdNum, String srcDirName,String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号;
fdNum:字段数组(去重的字段,0为整条记录,输入格式:0或逗号分隔的数字(1,2,3...);
srcDirName:源目录名;
dstDirName输出目录名,输出目录如果存在将会覆盖;
hostIp:要连接hiveserver主机的ip地址;
hostPort:hiveserver的端口,默认10000;
hostName:连接HIVE使用的用户名 例如 root;
hostPassword:连接HIVE用户使用的密码;
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中的科目去重,可用dedupe进行筛选
程序清单:
import com.dksou.fitting.dataprocess.service.DedupeService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DedupeClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
DedupeService.Client client;
TTransport tTransport;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dedupeService = new TMultiplexedProtocol(tProtocol, "DedupeService");
client=new DedupeService.Client( dedupeService );
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void dedup() throws TException {
String spStr = ",";
String fdNum = "1";
String srcDirName = "/test/in";
String dstDirName = "/test/out";
String hostIp = "192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
System.out.print("---------------------");
client.dedup( spStr,fdNum,srcDirName,dstDirName,hostIp,hostPort,hostName,hostPassword);
System.out.print("结束");
}
}
方法作用:
该方法可对某字段取最大值、最小值、求和、计算平均值。
方法签名:
void count(String fun, int fdSum, String spStr, int fdNum, String dirName, String hostIp, String hostPort, String hostName, String hostPassword) 。
返回:
计算结果。
签名参数说明:
fun:功能avg,min,max,sum,
fdSum:字段数量
spStr分隔符号,
fdNum:字段编号,
dirName:目录名
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中求所有成绩的平均值可用count中的avg功能
程序清单:
import com.dksou.fitting.dataprocess.service.DataStaticService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataStaticClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataStaticService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataStaticService = new TMultiplexedProtocol(tProtocol, "DataStaticService");
client = new DataStaticService.Client(dataStaticService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void count() throws TException {
String hostIp = "192.168.60.129";
String UserName = "root";
String UserPassword = "123456";
//功能 :avg,min,max,sum,
/**
* avg:平均数
* min:最小值
* max:最大值
* sum:总和
*/
String fun = "avg";
int fdSum = 8;
String spStr =",";
int fdNum = 6;
String dirName ="/test/in";
String hostPort ="10000";
System.out.print("---------------------");
client.count(fun, fdSum, spStr, fdNum, dirName, hostIp, hostPort, UserName, UserPassword);
System.out.print("结束");
}
}
方法作用:
该方法可计算某字段符合某条件的记录数。
方法签名:
void countRecord (String fun, int fdSum, String spStr, int fdNum, String compStr, String whereStr, String dirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
fun:功能count
fdSum:字段数量
spStr分隔符号,
fdNum:字段编号,
compStr:比较符号,>, <, >=, <=, =,!=用法:"'>='"
whereStr:比较条件
dirName:目录名
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码。
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中求一年级一共有多少学生可用countRecord功能
程序清单:
import com.dksou.fitting.dataprocess.service.DataStaticService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataStaticClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataStaticService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataStaticService = new TMultiplexedProtocol(tProtocol, "DataStaticService");
client = new DataStaticService.Client(dataStaticService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void CountRecord() throws TException {
//选择方法:count:计数
String fun = "count";
/**
* 字段编号 比较符号 条件
*/
//字段编号
int fdNum = 1;
//比较符号
String compStr = "=";
//条件
String whereStr = "1";
String dirName = "/test/in/";
String hostIp = "192.168.60.129";
String hostName = "root";
String hostPassword = "123456";
int fdSum = 8;
String spStr = ",";
String hostPort = "10000";
System.out.print("---------------------");
//计算某字段符合某条件的记录数,打印到控制台
client.countRecord(fun, fdSum, spStr, fdNum, compStr, whereStr, dirName,hostIp,hostPort,hostName,hostPassword);
System.out.print("结束");
}
}
方法作用:
该方法可用于对数据条件筛选分析或分组统计分析。
方法签名:
void analyse(String spStr, int fdSum, String whereStr, String groupStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr:分隔符号
fdSum:字段数量
whereStr:筛选条件,如:"\"f1='T100'\"",若无请写1=1
groupStr:分组条件,如:"f1",若无请写1
srcDirName:文件所在目录
dstDirName:数据所在目录
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。(1)学生数据中分组统计男生女生各有多少人。(2)学生数据中分组统计一年级学生中男生女生各有多少人
程序清单:
import com.dksou.fitting.dataprocess.service.DataAnalysisService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataAnalysisClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataAnalysisService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataAnalysisService = new TMultiplexedProtocol(tProtocol, "DataAnalysisService");
client = new DataAnalysisService.Client(dataAnalysisService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void analyse() throws TException {
String spStr = "','";
int fdSum = 8;
String whereStr = "\"f1='1'\"";
String groupStr = "f4";
String srcDirName = "/test";
String dstDirName= "/out";
String hostIp ="192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
System.out.print("---------------------");
client.analyse(spStr, fdSum, whereStr, groupStr, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword);
System.out.print("结束");
}
}
方法作用:
该方法可分析某两种物品同时出现的频率。
方法签名:
void apriori2(String spStr, int fdSum, String pNum, String oNum, String whereStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr:分隔符号
fdSum:字段数量
pNum:要分析的物品所在字段
oNum:订单号等所在字段
whereStr:筛选条件,如:"\"f1='T100'\"",若无请写1=1
srcDirName:文件所在目录
dstDirName:数据所在目录
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码
范例
如有商品订单数据,分析同时购买的两种商品出现的概率。f1为订单号字段,f2为商品字段。如下 订单号 商品 1 牛奶 1 面包 1 啤酒 2 牛奶 2果汁
程序清单:
import com.dksou.fitting.dataprocess.service.DataAnalysisService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataAnalysisClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataAnalysisService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataAnalysisService = new TMultiplexedProtocol(tProtocol, "DataAnalysisService");
client = new DataAnalysisService.Client(dataAnalysisService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void apriori2() throws TException {
String spStr = ",";
int fdSum = 2;
String pNum = "f2";
String oNum = "f1";
String whereStr = "1=1";
String srcDirName = "/test/into";
String dstDirName= "/test/out";
String hostIp ="192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
System.out.print("---------------------");
client.apriori2(spStr, fdSum, pNum, oNum, whereStr, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword);
System.out.print("结束");
}
}
方法作用:
该方法可分析某三种物品同时出现的频率。
方法签名:
void apriori3(String spStr, int fdSum, String pNum, String oNum, String whereStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr:分隔符号
fdSum:字段数量
pNum:要分析的物品所在字段
oNum:订单号等所在字段
whereStr:筛选条件,如:"\"f1='T100'\"",若无请写1=1
srcDirName:文件所在目录
dstDirName:数据所在目录
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码
范例
如有商品订单数据,分析同时购买的三种商品出现的概率。f1为订单号字段,f2为商品字段
程序清单:
import com.dksou.fitting.dataprocess.service.DataAnalysisService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataAnalysisClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataAnalysisService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataAnalysisService = new TMultiplexedProtocol(tProtocol, "DataAnalysisService");
client = new DataAnalysisService.Client(dataAnalysisService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void apriori3() throws TException {
String spStr = ",";
int fdSum = 2;
String pNum = "f2";
String oNum = "f1";
String whereStr = "1=1";
String srcDirName = "/test/into";
String dstDirName= "/test/out";
String hostIp ="192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
System.out.print("---------------------");
client.apriori3(spStr, fdSum, pNum, oNum, whereStr, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword);
System.out.print("结束");
}
}
方法作用:
调用此方法可以将不合法的记录去除掉。
方法签名:
void formatRecKerberos(String spStr, int fdSum, String whereStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)。
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号,
fdSum:字段数量(不符合该数量的记录将被清除),
srcDirName:源目录名,
dstDirName输出目录名,输出目录如果存在将会覆盖
hostIp:要连接hiveserver主机的ip地址.
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码。
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径;
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。不足8列的数据为不合法数据,应用formatRecKerberos可将不合法数据过滤掉,只筛选出合法数据
程序清单:
import com.dksou.fitting.dataprocess.service.DataCleanKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataCleanKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataCleanKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataCleanKerberosService = new TMultiplexedProtocol(tProtocol, "DataCleanKerberosService");
client = new DataCleanKerberosService.Client(dataCleanKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void formatRecKerberos() throws TException {
String hostIp = "192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
//数据分隔符
String spStr = ",";
int fdSum = 8;
String srcDirName = "/test/in/";
String dstDirName = "/test/out";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
System.out.print("---------------------");
//对不符合规则的数据进行清洗,得到符合字段数目的数据
client.formatRecKerberos(spStr, fdSum, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword ,user, krb5Path,keytabPath
,principalPath);
System.out.print("结束");
}
}
方法作用:
调用此方法可以按关键字过滤出想要的字段。
方法签名:
void formatFieldKerberos(String spStr, int fdSum, String fdNum, String regExStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号;
fdSum:字段数量;
fdNum:字段序号(检查哪个字段是否符合正则,0为全部检查),可为一个或多个,多个之间用逗号分隔(1,2,3...);
regExStr:字段中包含该字符的记录将被剔除(a,b,c),与字段序号相对应,多个字段时每个字段都符合该条件的记录将被剔除;
srcDirName:源目录名;
dstDirName输出目录名,输出目录如果存在将会覆盖;
hostIp:要连接hiveserver主机的ip地址;
hostPort:hiveserver的端口,默认10000;
hostName:连接HIVE使用的用户名 例如 root;
hostPassword:连接HIVE用户使用的密码;
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径;
principalPath:hive服务所对应的principal
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中查看除了一年级之外其他年级的成绩,可以用formatFieldKerberos将一年级的数据过滤掉
程序清单:
import com.dksou.fitting.dataprocess.service.DataCleanKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataCleanKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataCleanKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataCleanKerberosService = new TMultiplexedProtocol(tProtocol, "DataCleanKerberosService");
client = new DataCleanKerberosService.Client(dataCleanKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void formatFieldKerberos() throws TException {
String hostIp = "192.168.60.129";
String hostName = "root";
String hostPassword = "123456";
int fdSum = 8;
//数据分隔符
String spStr = ",";
String srcDirName = "/test/in/";
String dstDirName = "/test/out/";
String hostPort = "10000";
String user = "hive/[email protected]";
String krb5Path = "/etc/krb5.conf";
String keytabPath = "/home/hive.keytab";
String principalPath = "principal=hive/[email protected]";
String fdNum = "8";
String regExStr = "2";
System.out.print("---------------------");
//对不符合规则的数据进行清洗,得到符合字段数目的数据
client.formatFieldKerberos(spStr,fdSum,fdNum,regExStr, srcDirName,dstDirName, hostIp, hostPort, hostName, hostPassword ,user,path,principalPath);
System.out.print("结束");
}
}
方法作用:
调用此方法可以从所有字段中筛选出想要的几个字段数据。
方法签名:
void selectFieldKerberos(String spStr, int fdSum, String fdNum, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr:分隔符号;
fdSum:字段数量;
fdNum:字段数组(整数数组,内容是要保留的字段序号,没有编号的字段将去除),输入格式:逗号分隔的数字(1,2,3...);
srcDirName:源目录名;
dstDirName:输出目录名,输出目录如果存在将会覆盖;
hostIp:要连接hiveserver主机的ip地址;
hostPort:hiveserver的端口,默认10000;
hostName:连接HIVE使用的用户名 例如 root;
hostPassword:连接HIVE用户使用的密码;
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径; `
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中查看学生姓名及其家长姓名及联系方式,可用selectFieldKerberos只筛选出想查看的列中的信息
程序清单:
import com.dksou.fitting.dataprocess.service.DataCleanKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataCleanKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataCleanKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataCleanKerberosService = new TMultiplexedProtocol(tProtocol, "DataCleanKerberosService");
client = new DataCleanKerberosService.Client(dataCleanKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void selectFieldKerberos() throws TException {
String hostIp = "192.168.60.129";
String hostName = "root";
String hostPassword = "123456";
int fdSum = 8;
//数据分隔符
String spStr = ",";
String srcDirName = "/test/in/";
String dstDirName = "/test/out/";
String hostPort = "10000";
String fdNum = "1,2,3";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
System.out.print("---------------------");
//对不符合规则的数据进行清洗,得到符合字段数目的数据
client.selectFieldKerberos(spStr, fdSum, fdNum, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword,user, krb5Path,keytabPath
,principalPath);
System.out.print("结束");
}
}
方法作用:
调用此方法可以筛选出符合条件的记录条数。
方法签名:
void selectRecKerberos(String spStr, int fdSum, String whereStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号;
fdSum:字段数量;
whereStr:比较条件 f1 >= 2 and (f2=3 or f3=4),f1为第一个字段;
srcDirName:源目录名;
dstDirName输出目录名,输出目录如果存在将会覆盖;
hostIp:要连接hiveserver主机的ip地址;
hostPort:hiveserver的端口,默认10000;
hostName:连接HIVE使用的用户名 例如 root;
hostPassword:连接HIVE用户使用的密码;
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径;
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中查看语文成绩小于60分的学生信息,可用selectRec限定条件进行筛选
程序清单:
import com.dksou.fitting.dataprocess.service.DataCleanKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataCleanKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataCleanKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataCleanKerberosService = new TMultiplexedProtocol(tProtocol, "DataCleanKerberosService");
client = new DataCleanKerberosService.Client(dataCleanKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void selectRecKerberos() throws TException {
String hostIp = "192.168.60.129";
String hostName = "root";
String hostPassword = "123456";
int fdSum = 8;
String spStr =",";
String whereStr = "f5='语文' and f6 < '60'";
String srcDirName ="/test/in";
String dstDirName ="/test/out";
String hostPort ="10000";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
System.out.print("---------------------");
client.selectRecKerberos(spStr, fdSum, whereStr, srcDirName, dstDirName, hostIp,hostPort, hostName, hostPassword,user,path,principalPath);
System.out.print("结束");
}
}
方法作用:
该方法可筛选出不同的数据或字段。
方法签名:
void dedupKerberos(String spStr, String fdNum, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr分隔符号;
fdNum:字段数组(去重的字段,0为整条记录,输入格式:0或逗号分隔的数字(1,2,3...);
srcDirName:源目录名;
dstDirName输出目录名,输出目录如果存在将会覆盖;
hostIp:要连接hiveserver主机的ip地址;
hostPort:hiveserver的端口,默认10000;
hostName:连接HIVE使用的用户名 例如 root;
hostPassword:连接HIVE用户使用的密码;
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径; `
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中的科目去重,可用dedupKerberos进行筛选
程序清单:
import com.dksou.fitting.dataprocess.service.DedupeKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DedupeKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
DedupeKerberosService.Client client;
TTransport tTransport;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dedupeKerberosService = new TMultiplexedProtocol(tProtocol, "DedupeKerberosService");
client=new DedupeKerberosService.Client( dedupeKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void dedupKerberos() throws TException {
String spStr = ",";
String fdNum = "1";
String srcDirName = "/test/in";
String dstDirName = "/test/out";
String hostIp = "192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
System.out.print("---------------------");
client.dedupKerberos( spStr,fdNum,srcDirName,dstDirName,hostIp,hostPort,hostName,hostPassword,user, krb5Path,keytabPath,principalPath);
System.out.print("结束");
}
方法作用:
该方法可对某字段取最大值、最小值、求和、计算平均值。
方法签名:
double countKerberos(String fun, int fdSum, String spStr, int fdNum, String dirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
计算结果。
签名参数说明:
fun:功能avg,min,max,sum,
fdSum:字段数量
spStr分隔符号,
fdNum:字段编号,
dirName:目录名
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径; `
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中求所有成绩的平均值可用countKerberos 中的avg功能
程序清单:
import com.dksou.fitting.dataprocess.service.DataStaticKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataStaticKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 8090;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataStaticKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataStaticKerberosService = new TMultiplexedProtocol(tProtocol, "DataStaticKerberosService");
client = new DataStaticKerberosService.Client(dataStaticKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void countKerberos() throws TException {
String hostIp = "192.168.60.129";
String UserName = "root";
String UserPassword = "123456";
//功能 :avg,min,max,sum,
/**
* avg:平均数
* min:最小值
* max:最大值
* sum:总和
*/
String fun = "avg";
int fdSum = 8;
String spStr =",";
int fdNum = 6;
String dirName ="/text/in";
String hostPort ="10000";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
client.countKerberos(fun, fdSum, spStr, fdNum, dirName, hostIp, hostPort, UserName, UserPassword,user, krb5Path,keytabPath
,principalPath);
}
}
方法作用:
该方法可计算某字段符合某条件的记录数。
方法签名:
double countRecordKerberos(String fun, int fdSum, String spStr, int fdNum, String compStr, String whereStr, String dirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
计算记录数。
签名参数说明:
fun:功能count
fdSum:字段数量
spStr分隔符号,
fdNum:字段编号,
compStr:比较符号,>, <, >=, <=, =,!=用法:"'>='"
whereStr:比较条件
dirName:目录名
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码。
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径;
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
范例
1年级,2班级,3姓名,4性别,5科目,6成绩,7家长姓名,8联系方式共8列数据,之间以逗号分隔。学生数据中求一年级一共有多少学生可用countRecordKerberos功能
程序清单:
import com.dksou.fitting.dataprocess.service.DataStaticKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataStaticKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 8090;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataStaticKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataStaticKerberosService = new TMultiplexedProtocol(tProtocol, "DataStaticKerberosService");
client = new DataStaticKerberosService.Client(dataStaticKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void countRecordKerberos() throws TException {
//选择方法:count:计数
String fun = "count";
/**
* 字段编号 比较符号 条件
*/
//字段编号
int fdNum = 3;
//比较符号
String compStr = "'='";
//条件
String whereStr = "一年级";
String dirName = "/datas/card/data/";
String hostIp = "192.168.60.129";
String hostName = "root";
String hostPassword = "123456";
int fdSum = 5;
String spStr = ",";
String hostPort = "10000";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
//计算某字段符合某条件的记录数,打印到控制台
client.countRecordKerberos (fun, fdSum, spStr, fdNum, compStr, whereStr, dirName,hostIp,hostPort,hostName,hostPassword,user, krb5Path,keytabPath
,principalPath);
}
}
方法作用:
该方法可计算某字段符合某条件的记录数。
方法签名:
void analyseKerberos(String spStr, int fdSum, String whereStr, String groupStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr:分隔符号
fdSum:字段数量
whereStr:筛选条件,如:"\"f1='T100'\"",若无请写1=1
groupStr:分组条件,如:"f1",若无请写1
srcDirName:文件所在目录
dstDirName:数据所在目录
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径;
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
程序清单:
import com.dksou.fitting.dataprocess.service.DataCleanKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataCleanKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataCleanKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataCleanKerberosService = new TMultiplexedProtocol(tProtocol, "DataCleanKerberosService");
client = new DataCleanKerberosService.Client(dataCleanKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void analyseKerberos() throws TException {
String spStr = ",";
int fdSum = 8;
String whereStr = "1=1";
String groupStr = "1";
String srcDirName = "/test/in";
String dstDirName= "/test/out";
String hostIp ="192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
System.out.print("---------------------");
client.analyseKerberos(spStr, fdSum, whereStr, groupStr, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword,user, krb5Path,keytabPath
,principalPath);
System.out.print("结束");
}
}
方法作用:
该方法可分析某两种物品同时出现的频率。
方法签名:
apriori2Kerberos(String spStr, int fdSum, String pNum, String oNum, String whereStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
空—正确,非空:错误信息。
签名参数说明:
spStr:分隔符号
fdSum:字段数量
pNum:要分析的物品所在字段
oNum:订单号等所在字段
whereStr:筛选条件,如:"\"f1='T100'\"",若无请写1=1
srcDirName:文件所在目录
dstDirName:数据所在目录
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径;
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
范例:
如有商品订单数据,分析同时购买的两种商品出现的概率。f1为订单号字段,f2为商品字段。如下 订单号 商品 1 牛奶 1 面包 1 啤酒 2 牛奶 2果汁
程序清单:
import com.dksou.fitting.dataprocess.service.DataCleanKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataCleanKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataCleanKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataCleanKerberosService = new TMultiplexedProtocol(tProtocol, "DataCleanKerberosService");
client = new DataCleanKerberosService.Client(dataCleanKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void apriori2Kerberos() throws TException {
String spStr = ",";
int fdSum = 2;
String pNum = "f2";
String oNum = "f1";
String whereStr = "1=1";
String srcDirName = "/test/into";
String dstDirName= "/test/out";
String hostIp ="192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
System.out.print("---------------------");
client.apriori2Kerberos(spStr, fdSum, pNum, oNum, whereStr, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword,user, krb5Path,keytabPath
,principalPath);
System.out.print("结束");
}
}
方法作用:
该方法可分析某三种物品同时出现的频率。
方法签名:
void apriori3Kerberos(String spStr, int fdSum, String pNum, String oNum, String whereStr, String srcDirName, String dstDirName, String hostIp, String hostPort, String hostName, String hostPassword, String user, String krb5Path,String keytabPath, String principalPath)
返回:
空—正确,非空:错误信。
签名参数说明:
spStr:分隔符号
fdSum:字段数量
pNum:要分析的物品所在字段
oNum:订单号等所在字段
whereStr:筛选条件,如:"\"f1='T100'\"",若无请写1=1
srcDirName:文件所在目录
dstDirName:数据所在目录
hostIp:要连接hiveserver主机的ip地址,
hostPort:hiveserver的端口,默认10000
hostName:连接HIVE使用的用户名 例如 root,
hostPassword:连接HIVE用户使用的密码
user:Service Principal登陆用户名;
krb5Path:krb5.conf存放路径;
keytabPath:hive.keytab存放路径;
principalPath:hive服务所对应的principal
例: principal=hive/[email protected]
范例:
如有商品订单数据,分析同时购买的三种商品出现的概率。f1为订单号字段,f2为商品字段
程序清单:
import com.dksou.fitting.dataprocess.service.DataCleanKerberosService;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TMultiplexedProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
import org.apache.thrift.transport.TTransportException;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataCleanKerberosClient {
public static final String SERVER_IP = "localhost";
public static final int SERVER_PORT = 9872;
public static final int TIMEOUT = 300000;//容易连接超时 要增大时间
TTransport tTransport;
DataCleanKerberosService.Client client;
@Before
public void Before() throws TTransportException {
tTransport=new TSocket( SERVER_IP,SERVER_PORT );
TProtocol tProtocol = new TBinaryProtocol(tTransport);
TMultiplexedProtocol dataCleanKerberosService = new TMultiplexedProtocol(tProtocol, "DataCleanKerberosService");
client = new DataCleanKerberosService.Client(dataCleanKerberosService);
tTransport.open();
}
@After
public void close(){
tTransport.close();
}
@Test
public void apriori3Kerberos() throws TException {
String spStr = ",";
int fdSum = 2;
String pNum = "f2";
String oNum = "f1";
String whereStr = "1=1";
String srcDirName = "/test/into";
String dstDirName= "/test/out";
String hostIp ="192.168.60.129";
String hostPort = "10000";
String hostName = "root";
String hostPassword = "123456";
String user = "hive/[email protected]";
String keytabPath = "/home/hive.keytab";
String krb5Path = "/etc/krb5.conf";
String principalPath = "principal=hive/[email protected]";
System.out.print("---------------------");
client.apriori3Kerberos(spStr, fdSum, pNum, oNum, whereStr, srcDirName, dstDirName, hostIp, hostPort, hostName, hostPassword,user, krb5Path,keytabPath
,principalPath);
System.out.print("结束");
}
}
官网地址:http://www.dksou.com