- 浏览: 274381 次
- 性别:
- 来自: 广州
文章分类
- 全部博客 (247)
- free talking (11)
- java (18)
- search (16)
- hbase (34)
- open-sources (0)
- architect (1)
- zookeeper (16)
- vm (1)
- hadoop (34)
- nutch (33)
- lucene (5)
- ubuntu/shell (8)
- ant (0)
- mapreduce (5)
- hdfs (2)
- hadoop sources reading (13)
- AI (0)
- distributed tech (1)
- others (1)
- maths (6)
- english (1)
- art & entertainment (1)
- nosql (1)
- algorithms (8)
- hadoop-2.5 (16)
- hbase-0.94.2 source (28)
- zookeeper-3.4.3 source reading (1)
- solr (1)
- TODO (3)
- JVM optimization (1)
- architecture (0)
- hbase-guideline (1)
- data mining (3)
- hive (1)
- mahout (0)
- spark (28)
- scala (3)
- python (0)
- machine learning (1)
最新评论
-
jpsb:
...
为什么需要分布式? -
leibnitz:
hi guy, this is used as develo ...
compile hadoop-2.5.x on OS X(macbook) -
string2020:
撸主真土豪,在苹果里面玩大数据.
compile hadoop-2.5.x on OS X(macbook) -
youngliu_liu:
怎样运行这个脚本啊??大牛,我刚进入搜索引擎行业,希望你能不吝 ...
nutch 数据增量更新 -
leibnitz:
also, there is a similar bug ...
2。hbase CRUD--Lease in hbase
以下是转载的执行recrawl的脚本(其实还是可以再优化的,比如参数和备份处理过程等 ),来对比 一下与普通的crawl有啥区别。
# runbot script to run the Nutch bot for crawling and re-crawling. # Usage: bin/runbot [safe] # If executed in 'safe' mode, it doesn't delete the temporary # directories generated during crawl. This might be helpful for # analysis and recovery in case a crawl fails. # # Author: Susam Pal depth=2 threads=5 adddays=5 topN=15 #Comment this statement if you don't want to set topN value # Arguments for rm and mv RMARGS="-rf" MVARGS="--verbose" # Parse arguments if [ "$1" == "safe" ] then safe=yes fi if [ -z "$NUTCH_HOME" ] then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -z "$CATALINA_HOME" ] then CATALINA_HOME=/opt/apache-tomcat-6.0.10 echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script else echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME fi if [ -n "$topN" ] then topN="-topN $topN" else topN="" fi steps=8 echo "----- Inject (Step 1 of $steps) -----" $NUTCH_HOME/bin/nutch inject crawl/crawldb urls echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" for((i=0; i < $depth; i++)) do echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN \ -adddays $adddays if [ $? -ne 0 ] then echo "runbot: Stopping at depth $depth. No more URLs to fetch." break fi segment=`ls -d crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo "runbot: fetch $segment at depth `expr $i + 1` failed." echo "runbot: Deleting segment $segment." rm $RMARGS $segment continue fi $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment done echo "----- Merge Segments (Step 3 of $steps) -----" $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* if [ "$safe" != "yes" ] then rm $RMARGS crawl/segments else rm $RMARGS crawl/BACKUPsegments mv $MVARGS crawl/segments crawl/BACKUPsegments fi mv $MVARGS crawl/MERGEDsegments crawl/segments echo "----- Invert Links (Step 4 of $steps) -----" $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/* echo "----- Index (Step 5 of $steps) -----" $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb \ crawl/segments/* echo "----- Dedup (Step 6 of $steps) -----" $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes echo "----- Merge Indexes (Step 7 of $steps) -----" $NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes echo "----- Loading New Index (Step 8 of $steps) -----" ${CATALINA_HOME}/bin/shutdown.sh if [ "$safe" != "yes" ] then rm $RMARGS crawl/NEWindexes rm $RMARGS crawl/index else rm $RMARGS crawl/BACKUPindexes rm $RMARGS crawl/BACKUPindex mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes mv $MVARGS crawl/index crawl/BACKUPindex fi mv $MVARGS crawl/NEWindex crawl/index ${CATALINA_HOME}/bin/startup.sh echo "runbot: FINISHED: Crawl completed!" echo ""
其中recrawl的差异:
* 指定了adddays参数:由于fetch time在fetch时被更新为下次要抓取的时间,所以,这里是指定当前时间+adddays后哪些urls是过早的,则需要抓取。所以如果是负数表明相对于是正数来产,将缩小抓取范围。即adddays后即将过期的urls将recrawl
* 多了mergesegs job: 将旧版本中的数据用新版本替换,同时将刚生成的segs合并到一个seg或多个指定大小的segs中。个人这个是可选的 。这个应该与下面说的merge job同时存在或去掉。
NOTE:
merge job(合并索引):对于web crawl来说是不应该执行这个的,因为这样做反而达不到分布式搜索的目的了。
此代码已经在修改,增加上验证;剩下删除没有。(个人认为是:如果url到期后再fetch时,如果发现unavailable时将剔除)
* 这里的recrawl ,但index并非是真正的increment index,而是按照之前的(合并的)segments和新的linkdb,crawldb三个目 录重新生成的;所以index的输出目录indexes同样也并非是之前的旧索引文件夹,而是全新的索引(当然没去重)
compare:
nutch 分布式索引(爬虫)
see
linux下nutch的增量抓去(recrawl)-- 注意事项1
nutch研究记录3(增量爬行) --注意事项2
发表评论
-
hadoop-replication written flow
2017-08-14 17:00 507w:net write r :net read( ... -
搜索引擎有多聪明?
2017-02-11 13:56 273ref:https://www.seozac.com/seo/ ... -
搜索引擎中的信息处理和概率论
2017-02-06 15:57 424info theory and maths use ... -
hbase-export table to json file
2015-12-25 17:21 1639i wanna export a table to j ... -
yarn-similar logs when starting up container
2015-12-09 17:17 91215/12/09 16:47:52 INFO yarn.E ... -
hadoop-compression
2015-10-26 16:52 456http://blog.cloudera.com/blog ... -
hoya--hbase on yarn
2015-04-23 17:00 399Introducing Hoya – HBase on YA ... -
compile hadoop-2.5.x on OS X(macbook)
2014-10-30 15:42 2441same as compile hbase ,it ' ... -
upgrades of hadoop and hbase
2014-10-28 11:39 7071.the match relationships ... -
how to submit jars to a map reduce job?
2014-04-02 01:23 508there maybe two ways : 1.serv ... -
install snappy compression in hadoop and hbase
2014-03-08 00:36 4271.what is snappy ... -
3。hbase rpc/ipc/proxy通信机制
2013-07-15 15:12 1256一。RPC VS IPC (relationship/di ... -
hadoop-2 dfs/yarn 相关概念
2012-10-03 00:22 1868一.dfs 1.旧的dfs方案 可以看到bloc ... -
hadoop 删除节点(Decommission nodes)
2012-09-02 03:28 2643具体的操作步骤网上已经很多,这里只说明一下自己操作过程注意事项 ... -
hadoop 2(0.23.x) 与 0.20.x比较
2012-07-01 12:09 2175以下大部分内容来自网络,这里主要是进行学习,比较 1、 ... -
hadoop-2.0 alpha standalone install
2012-06-10 12:02 2461看了一堆不太相关的东西... 其实只要解压运行即可,形 ... -
hadoop源码阅读-shell启动流程-start-all
2012-05-06 01:13 841when executes start-all.sh ... -
hadoop源码阅读-shell启动流程
2012-05-03 01:58 1842open the bin/hadoop file,you w ... -
hadoop源码阅读-第二回阅读开始
2012-05-03 01:03 998出于工作需要及版本更新带来的变动,现在开始再次进入源码 ... -
hadoop 联合 join操作
2012-01-02 18:06 973hadoop join操作类似于sql中的功能,就是对多表进行 ...
相关推荐
nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据
Nutch搜索引擎数据获取1、 基本原理2、网络蜘蛛3、局域网抓取
spider,主要是java版本的数据下载,内部有doc
Nutch开源搜索引擎增量索引recrawl的终极解决办法续
1.1 Nutch 基本原理 1.1.1 Nutch 基本组成 1.1.2 Nutch 工作流程 1.2 Nutch 流程详解 1.2.1 Nutch 数据流程 1.2.2 Nutch 流程分析
nutch 爬到的CSDN数据 nutch crawlnutch 爬到的CSDN数据 nutch crawlnutch 爬到的CSDN数据 nutch crawl
当时,大数据用来描述为更新网络搜索索引需要同时进行批量处理或分析的大量数据集。现在,大数据的含义已经被极大地发展了,业界将大数据的特性归纳为4个“V”。Volume数据体量巨大,Variety数据类型繁多,Value价值...
Nutch开源搜索引擎增量索引recrawl的终极解决办法
3.2.1 nutch数据集的基本组成:.....18 3.2.2 爬行"官方"网址.....18 3.2.3 爬行中文网址....22 4. nutch基本原理分析...23 4.1 nutch的基本组成.23 4.2 nutch工作流程.....23 5. nutch工作流程分析...25 5.1 爬虫....
资源名称:Nutch相关框架视频教程资源目录:【】Nutch相关框架视频教程1_杨尚川【】Nutch相关框架视频教程2_杨尚川【】Nutch相关框架视频教程3_杨尚川【】Nutch相关框架视频教程4_杨尚川【】Nutch相关框架视频教程5_...
学习nutch 源码解读 轻松入门 搭建自己的nutch搜索引擎
eclipse配置nutch,eclipse配置nutch
Nutch 读取搜索结果目录统计数据、提取链接结构信
nutch使用&Nutch;入门教程 pdf
Nutch搜索引擎·Nutch简单应用(第3期) 1.1 Nutch 命令详解 1.2 Nutch 简单应用
是Apache旗下的一个用Java实现的开源搜索引擎项目,自Nutch1.2版本之后,Nutch已经从搜索引擎演化为网络爬虫,接着Nutch进一步演化为两大分支版本:1.X和2.X,最大的区别在于2.X对底层的数据存储进行了抽象以支持...
nutch插件,安装nutch插件,mysql与nutch
当前项目基于Nutch 1.X系列已停止更新维护,转向Nutch 2.x系列版本的新项目:http://www.oschina.net/p/nutch-ajax 项目简介 基于Apache Nutch 1.8和Htmlunit组件,实现对于AJAX加载类型页面的完整页面内容...
nutch1.2测试文档
Linux下Nutch分布式配置 使用:分布式爬虫、索引、Nutch搜索本地数据、Nutch搜索HDFS数据。