网络爬虫java 思维导图模板_ProcessOn思维导图、流程图

读取网页和网页分析系统

originalPageGetter类：根据输入参数从原始raws文件中读取网页的功能

成员变量

String url=""；

String urlFromHead=""

DBConnection dbc = new DBConnection()

MD5 md5 = new MD5()

Configuration conf = new Configuration()

成员函数

getPage()

函数变量

String content = ""

调用函数

Page page = getRawsInfo(url)

page.getRawName()

readRawHead()

content = readRawContent(bfReader);

getContent(String file, int offset)：：通过传入的文件名和偏移量得到网页内容

调用函数

readRawHead（文件地址字符流）

根据文件名file，得到文件地址，然后将文件读入，存入bfReader中

readRawHead（文件地址字符流）

readRawContent（文件地址字符流）

返回String类型数据

content

readRawHead(BufferedReader bfReader)

readRawContent(BufferedReader bfReader)得到网页内容信息

getRawsInfo（String url）查询，并存入数据库

sql = select * from pageindex where url='url'

ResultSet rs = dbc.executeQuery(sql);

循环

connent = rs.getString("connent");

offset = Integer.parseInt(rs.getString("offset"));

raws = rs.getString("raws")

return new Page(url, offset, connent, raws);

RawsAnalyzer类:实现了从原始网页集合Raws的分析操作。

成员变量

DBConnection dbc = new DBConnection();

MD5 md5 = new MD5();

int offset;

Page page;

String rootDirectory;

成员函数

带参构造函数RawsAnalyzer(String rootName)

this.rootDirectory = rootName;

page = new Page();

createPageIndex()

ArrayList<String> fileNames = getSubFile(rootDirectory);

循环执行createPageIndex(fileName)

createPageIndex(String fileName)，重载函数

fileReader = new FileReader(fileName);读入文件

bfReader = new BufferedReader(fileReader)；文件字符缓冲流

按行循环读入

url = readRawHead(bfReader);

content = readRawContent(bfReader);

contentMD5 = md5.getMD5ofStr(content);

page.setPage(url, oldOffset, contentMD5, fileName);

page.add2DB(dbc);

readRawHead(BufferedReader bfReader)

得到urlLine

readRawContent(BufferedReader bfReader)

static ArrayList<String> getSubFile(String fileName)：得到文件的绝对路径

main()函数

RawsAnalyzer analyzer = new RawsAnalyzer("Raws");

analyzer.createPageIndex();

工具系统

DBConnection 类：数据库连接

加载驱动程序；Class.forName（）

连接数据库；DriverManager.getConnection（）

访问数据库conn.createStatement（）

执行查询或更新；executeQuery（），executeUpdate（）

HtmlParser类：网页处理

成员函数

html2Text(String inputString)，参数是：含html标签的字符串

java正则表达式通过java.util.regex包下的Pattern类与Matcher类实现

Pattern类用于创建一个正则表达式，Pattern.compile（String regex）

Matcher类提供了对正则表达式的分组支持,以及对正则表达式的多次匹配支持.

作用：去除一些标签，返回文本字符串

ArrayList<URL> urlDetector(String htmlDoc)

作用：去除一些无用链接，修复一些相对路径的链接

提取href="http://bbs.life.sina.com.cn/"中的双引号之间的url地址，并存入数组中

MD5类：加密

消息摘要算法第五版

用于确保信息传输完整一致而广泛使用的散列算法之一

将数据（如一段文字）运算变为另一固定长度值

Page：将网页存到数据库

成员变量：网页存储格式

String url

int offset

String connent

String rawName

主要成员函数

add2DB(DBConnection dbc)

insert into pageindex(url, connent, offset, raws) values ('url', 'connent', 'offset', 'rawName');

Result类

成员变量

private String title; private String content; private String url; private String date;

成员函数

构造函数：Result(String title, String content, String url, String date)

this.title = title; this.content = content; this.url = url; this.date = date;

ResultGenerator类

成员函数

ResultGenerator()构造函数

pageGetter = new originalPageGetter()

String regEx_meta = "<meta[\\s\\S]*?>"; meta的正则表达式

p_title = Pattern.compile(regEx_title,Pattern.CASE_INSENSITIVE); p_meta = Pattern.compile(regEx_meta,Pattern.CASE_INSENSITIVE);匹配

Result generateResult(String url)

Page page; String content = ""; String date = ""; String title = ""; String shortContent = "";

子主题

page = pageGetter.getRawsInfo(url);

content = pageGetter.getContent(page.getRawName(), page.getOffset());

date = pageGetter.getDate();

return new Result(title, shortContent, url, date);

StopWordsMerger类

成员变量

成员函数

HashSet<String> scanDict(String stopDictFile)读入停用词，存在哈希表中

mergeSet(HashSet<String> set1, HashSet<String> set2)合并停用词哈希表

爬虫

Dispatcher类：将新的url添加到对应的数组中

成员变量

ArrayList<URL> urls

ArrayList<URL> visitedURLs：已访问过

ArrayList<URL> unvisitedURLs：未访问过

主要成员函数，这里的函数代码都分别加锁了

insert(单个url/url数组)：插入新的url

if (!urls.contains(url) && !visitedURLs.contains(url)) urls.add(url);

getURL()：读取url

urls.isEmpty()：等待

url = urls.get(0)：取出集合中第一条数据

visitedURLs.add(url);添加

urls.remove(url);移除

Gather类 implements Runnable，实现run()方法就可以实现多线程

成员变量

Dispatcher disp

String ID

URLClient client = new URLClient()

WebAnalyzer analyzer = new WebAnalyzer()

主要成员函数

Gather(String ID, Dispatcher disp)含参构造函数

在文件夹Raws下创建新文件RAW__ID

向文件中写入

run()

循环

url = disp.getURL()

htmlDoc = client.getDocumentAt(url)存储url页面内容

htmlDoc不为空的情况下

ArrayList<URL> newURL = analyzer.doAnalyzer(bfWriter, url, htmlDoc);完成url解析并且返回，保存指定格式的doc

disp.insert(newURL)

URLClient类

成员函数

getDocumentAt(URL url)根据网址得到网页内容

URL hostURL = url;

URLConnection conn = hostURL.openConnection();创建连接对象

reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));

循环，如果reader.readLine()！=null，那么document.append(line + "\n");

WebAnalyzer类

成员函数

ArrayList<URL> doAnalyzer(BufferedWriter bfWriter, URL url, String htmlDoc)

ArrayList<URL> urlInHtmlDoc = (new HtmlParser()).urlDetector(htmlDoc); 根据网页内容htmlDoc得到这个网页中包含的其他的urls，并将其存入动态数组urlInHtmlDoc中

saveDoc(bfWriter, url, htmlDoc)将抽取了内容的htmlDoc按照指定的格式写入文件中进行保存

saveDoc(BufferedWriter bfWriter, URL url, String htmlDoc)

保存在指定文件的格式

version:1.0 url:http://www.163.com date:Thu Mar 12 20:55:55 CST 2015 IP:111.202.57.27 length:625461 网页内容

Spider类：爬虫实现

成员变量

ArrayList<URL> urls

int gatherNum = 5

主要成员函数

start()启动线程

Dispatcher disp = new Dispatcher(urls);

循环，gatherNum

Thread gather = new Thread(new Gather(String.valueOf(i), disp));创建线程

gather.start();

main（）函数

ArrayList<URL> urls = new ArrayList<URL>();

urls.add(new URL("http://www.baidu.com"));向数组中增添url

Spider spider = new Spider(urls);

spider.start();

Configuration类：从配置文件中获得其他文件路径

成员变量

Properties类的对象 propertie

FileInputStream类的对象 inputFile

FileOutputStream类的对象 outputFile

成员函数

构造函数Configuration()

propertie = new Properties()：Creates an empty property list

读入文件configure.properties

String getValue(String key)

根据key得到某一属性的值value

main（）

RAWSPATH

DICTIONARYPATH

MYSQLLIBPATH

分词系统

DictReader类：将本地词汇文件读入，存到哈希表中，便于查询。

成员函数

scanDict(String dictFile)：按行读入dictFile文件并存到哈希表dictionary中

构造函数（空）

DictSegment类：将网页分成词汇，便于提取关键词

成员变量

HashSet<String> dict

HashSet<String> stopWordDict

DictReader dictReader

static final int maxLength = 4

static String dictFile = ""

static String stopDictFile = ""

Configuration conf

成员函数

构造函数DictSegment()

new Configuration()

dictFile="wordlist.txt"

stopDictFile="stopWord.txt"

dict = dictReader.scanDict(dictFile);

stopWordDict = dictReader.scanDict(stopDictFile);

SegmentFile(String htmlDoc): htmlDoc的预处理

第一步操作，把html的文件用正则表达式处理，去掉标签等无用信息，保留文本进行操作

新建类HtmlParser的对象parser

htmlText =parser.html2Text(htmlDoc)

第二步操作，断句cutIntoSentance，把句子传到cutIntoWord，然后获得返回值

cutIntoSentance(htmlText)

循环：cutIntoWord(sentances.get(i)

ArrayList<String> segResult

cutIntoSentance(String htmlDoc)：以。，、；：？！“”‘’《》（）-作为定界符分割htmlDoc

创建StringTokenizer类的对象tokenizer,并构造字符串tokenizer的分析器

ArrayList<String> sentance：存储返回的字符串

cutIntoWord(String sentance)：过滤停用词，过滤单字

main（）

DictSegment dictSeg = new DictSegment();

建立url=words索引系统

ForwardIndex类：建立url-words的索引

成员变量

DBConnection dbc = new DBConnection();

HashMap<String, ArrayList<String>> indexMap = new HashMap<String, ArrayList<String>>();

originalPageGetter pageGetter = new originalPageGetter()

DictSegment dictSeg = new DictSegment()

成员函数

构造函数（空）

createForwardIndex()

函数变量

ArrayList<String> segResult：存储网页对应的词组

sql = "select * from pageindex"：查询语句

调用的函数

ResultSet rs = dbc.executeQuery(sql)：执行查询，而且将查询结果存储到rs中

rs.next()：指示每一个数据行，用于循环，在每次循环中：

rs.getString("url")，fileName=rs.getString("raws")，rs.getString("offset")根据表的列名得到对应的内容

htmlDoc = pageGetter.getContent(fileName, offset)：根据文件名和偏移量得到网站内容

segResult = dictSeg.SegmentFile(htmlDoc)：对网站内容进行词汇提取

indexMap.put(url, segResult)：构建url与对应词汇的映射，存储在indexMap中

main（）

迭代器（Iterator）：对HashMap进行遍历并输出显示

InvertedIndex类：建立网页倒排索引：word映射urls

成员变量

HashMap<String, ArrayList<String>> fordwardIndexMap;

HashMap<String, ArrayList<String>> invertedIndexMap;

成员函数

InvertedIndex()：构造函数

fordwardIndexMap = forwardIndex.createForwardIndex();

createInvertedIndex()

invertedIndexMap = new HashMap<String, ArrayList<String>>()

for循环：遍历原来的正向索引，进行倒排

取出url-words键值对

遍历words

倒排索引中还没有这个词，invertedIndexMap.put(word, urls);

索引中已经含有这个词，不需要加入这个词，需要找到这个词从而把对应的url链接附上，urls.add(url);

getInvertedIndex()

return invertedIndexMap

main（）

根据关键词得到链接地址

查询系统

Response类：

成员变量

HashMap<String, ArrayList<String>> invertedIndexMap;倒排索引

ArrayList<Result> results;返回结果列表

DictSegment dictSeg;分词器

ResultGenerator resultGenerator;

成员函数

getResponse(String request)

调用函数doQuery(request);

doQuery(String request)

1. 关键词分词、剔除停用词，并对分词结果进行查找对应的结果

2. 合并各个分词的结果，返回初步的网页URL信息

3. 根据URL通过数据库获得网页所在位置，从而在RAWs中获得网页内容