- 浏览: 572249 次
- 性别:
- 来自: 广州
文章分类
- 全部博客 (365)
- Tomcat调优 (2)
- Apache Http (20)
- Webserver安装 (5)
- Linux安装 (28)
- Linux常用命令 (17)
- C语言及网络编程 (10)
- 文件系统 (0)
- Lucene (12)
- Hadoop (9)
- FastDFS (8)
- 报表 (0)
- 性能测试 (1)
- JAVA (18)
- CSharp (3)
- C++ (38)
- BI (0)
- 数据挖掘 (0)
- 数据采集 (0)
- 网址收集整理 (3)
- Resin (0)
- JBoss (0)
- nginx (0)
- 数据结构 (1)
- 随记 (5)
- Katta (1)
- Shell (6)
- webservice (0)
- JBPM (2)
- JQuery (6)
- Flex (41)
- SSH (0)
- javascript (7)
- php (13)
- 数据库 (6)
- 搜索引擎排序 (2)
- LVS (3)
- solr (2)
- windows (1)
- mysql (3)
- 营销软件 (1)
- tfs (1)
- memcache (5)
- 分布式搜索 (3)
- 关注的博客 (1)
- Android (2)
- clucene (11)
- 综合 (1)
- c c++ 多线程 (6)
- Linux (1)
- 注册码 (1)
- 文件类型转换 (3)
- Linux 与 asp.net (2)
- perl (5)
- coreseek (1)
- 阅读器 (2)
- SEO (1)
- 励志 (1)
- 在线性能测试工具 (1)
- yii (7)
- 服务器监控 (1)
- 广告 (1)
- 代理服务 (5)
- zookeeper (8)
- 广告联盟 (0)
- 常用软件下载 (1)
- 架设自已的站点心得 (0)
最新评论
-
terry07:
java 7 用这个就可以了 Desktop desktop ...
关于java Runtime.getRunTime.exec(String command)的使用 -
HSINKING:
怎么设置打开的dos 窗口是指定的路径下
关于java调用bat文件,不打开窗口 -
liubang201010:
hyperic hq更多参考资料,请访问:http://www ...
hyperic-hq -
^=^:
STDIN_FILENO是unistd.h中定义的一个numb ...
深入理解dup和dup2的用法 -
antor:
留个记号,学习了
[转]用java流方式判断文件类型
%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;%SYSTEMROOT%\System32\WindowsPowerShell\v1.0\;D:\Program Files\Microsoft SQL Server\90\Tools\binn\;D:\Java\jdk1.6.0\bin;K:\cygwinnew\bin;D:\Program Files\Adobe\Flex Builder 3\sdks\3.2.0\bin;D:\MinGW\bin;D:\Program Files\Microsoft SQL Server\100\Tools\Binn\;D:\Program Files\Microsoft SQL Server\100\DTS\Binn\;D:\Program Files\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\;D:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\PrivateAssemblies\;D:\Program Files\TortoiseSVN\bin;E:\xpdf\chinese-simplified;E:\xpdf\chinese-simplified\CMap
authorized_keys
/cygdrive/D/Java/jdk1.6.0
/cygdrive/D/tmp/testdata/input
/cygdrive/D/tmp/testoutput
D:\tmp\testoutput
hadoop namenode -formate
D:\tmp\testdata\input
上传 input下的文件到 dfs中的input文件中
$ ./bin/hadoop fs -put D:/tmp/testdata/input input
jar hadoop-0.20.1-examples.jar wordcount input/input output-dir ,其中hadoop-0.16.4-examples.jar
$ ./bin/hadoop jar hadoop-0.20.1-examples.jar wordcount input/input output-di
r
11/12/28 17:39:40 INFO input.FileInputFormat: Total input paths to process : 3
11/12/28 17:39:41 INFO mapred.JobClient: Running job: job_201112281720_0003
11/12/28 17:39:42 INFO mapred.JobClient: map 0% reduce 0%
11/12/28 17:39:51 INFO mapred.JobClient: map 66% reduce 0%
11/12/28 17:39:54 INFO mapred.JobClient: map 100% reduce 0%
11/12/28 17:40:03 INFO mapred.JobClient: map 100% reduce 100%
11/12/28 17:40:05 INFO mapred.JobClient: Job complete: job_201112281720_0003
11/12/28 17:40:05 INFO mapred.JobClient: Counters: 17
11/12/28 17:40:05 INFO mapred.JobClient: Job Counters
11/12/28 17:40:05 INFO mapred.JobClient: Launched reduce tasks=1
11/12/28 17:40:05 INFO mapred.JobClient: Launched map tasks=3
11/12/28 17:40:05 INFO mapred.JobClient: Data-local map tasks=3
11/12/28 17:40:05 INFO mapred.JobClient: FileSystemCounters
11/12/28 17:40:05 INFO mapred.JobClient: FILE_BYTES_READ=290
11/12/28 17:40:05 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/28 17:40:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=607
11/12/28 17:40:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=139
11/12/28 17:40:05 INFO mapred.JobClient: Map-Reduce Framework
11/12/28 17:40:05 INFO mapred.JobClient: Reduce input groups=0
11/12/28 17:40:05 INFO mapred.JobClient: Combine output records=16
11/12/28 17:40:05 INFO mapred.JobClient: Map input records=3
11/12/28 17:40:05 INFO mapred.JobClient: Reduce shuffle bytes=221
11/12/28 17:40:05 INFO mapred.JobClient: Reduce output records=0
11/12/28 17:40:05 INFO mapred.JobClient: Spilled Records=32
11/12/28 17:40:05 INFO mapred.JobClient: Map output bytes=284
11/12/28 17:40:05 INFO mapred.JobClient: Combine input records=30
11/12/28 17:40:05 INFO mapred.JobClient: Map output records=30
11/12/28 17:40:05 INFO mapred.JobClient: Reduce input records=16
bin/hadoop jar hadoop-0.20.1-index.jar org.apache.hadoop.contrib.index.main.UpdateIndex -inputPaths input/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 2 -numMapTasks 2 -conf conf/index-config.xml
bin/hadoop jar hadoop-0.20.1-index.jar org.apache.hadoop.contrib.index.main.UpdateIndex -inputPaths D:/tmp/testdata/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths D:/tmp/testdata/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input(dfs文件系统中的目录) -outputPath index_msg_out_010(dfs文件系统中的目录) -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
11/12/29 10:05:20 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 10:05:20 INFO main.UpdateIndex: outputPath = index_msg_out_010
11/12/29 10:05:20 INFO main.UpdateIndex: shards = null
11/12/29 10:05:20 INFO main.UpdateIndex: indexPath = index_030
11/12/29 10:05:20 INFO main.UpdateIndex: numShards = 1
11/12/29 10:05:20 INFO main.UpdateIndex: numMapTasks= 1
11/12/29 10:05:20 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 10:05:21 INFO main.UpdateIndex: sea.index.updater = org.apache.hadoop.c
ontrib.index.mapred.IndexUpdater
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localhost:
18888/user/kelo-dichan/administrator/input/input
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://localhost
:18888/user/kelo-dichan/administrator/index_msg_out_010
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.map.tasks = 1
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.reduce.tasks = 1
11/12/29 10:05:21 INFO mapred.IndexUpdater: 1 shards = -1@index_030/00000@-1
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.input.format.class = org.apac
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 10:05:21 WARN mapred.JobClient: Use GenericOptionsParser for parsing th
e arguments. Applications should implement Tool for the same.
11/12/29 10:05:21 INFO mapred.FileInputFormat: Total input paths to process : 3
11/12/29 10:05:23 INFO mapred.JobClient: Running job: job_201112281720_0005
11/12/29 10:05:24 INFO mapred.JobClient: map 0% reduce 0%
运行成功
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPath
index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/i
dex-config.xml
11/12/29 10:17:12 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 10:17:12 INFO main.UpdateIndex: outputPath = index_msg_out_012
11/12/29 10:17:12 INFO main.UpdateIndex: shards = null
11/12/29 10:17:12 INFO main.UpdateIndex: indexPath = index_032
11/12/29 10:17:12 INFO main.UpdateIndex: numShards = 1
11/12/29 10:17:12 INFO main.UpdateIndex: numMapTasks= 1
11/12/29 10:17:12 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 10:17:13 INFO main.UpdateIndex: sea.index.updater = org.apache.hadoop
ontrib.index.mapred.IndexUpdater
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localhos
18888/user/kelo-dichan/administrator/input/input
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://localho
:18888/user/kelo-dichan/administrator/index_msg_out_012
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.map.tasks = 1
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.reduce.tasks = 1
11/12/29 10:17:13 INFO mapred.IndexUpdater: 1 shards = -1@index_032/00000@-1
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.input.format.class = org.ap
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 10:17:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing
e arguments. Applications should implement Tool for the same.
11/12/29 10:17:13 INFO mapred.FileInputFormat: Total input paths to process :
11/12/29 10:17:13 INFO mapred.JobClient: Running job: job_201112291014_0002
11/12/29 10:17:14 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 10:17:22 INFO mapred.JobClient: map 66% reduce 0%
11/12/29 10:17:25 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 10:17:32 INFO mapred.JobClient: map 100% reduce 22%
11/12/29 10:17:38 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 10:17:40 INFO mapred.JobClient: Job complete: job_201112291014_0002
11/12/29 10:17:40 INFO mapred.JobClient: Counters: 18
11/12/29 10:17:40 INFO mapred.JobClient: Job Counters
11/12/29 10:17:40 INFO mapred.JobClient: Launched reduce tasks=1
11/12/29 10:17:40 INFO mapred.JobClient: Launched map tasks=3
11/12/29 10:17:40 INFO mapred.JobClient: Data-local map tasks=3
11/12/29 10:17:40 INFO mapred.JobClient: FileSystemCounters
11/12/29 10:17:40 INFO mapred.JobClient: FILE_BYTES_READ=2104
11/12/29 10:17:40 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/29 10:17:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2910
11/12/29 10:17:40 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=622
11/12/29 10:17:40 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 10:17:40 INFO mapred.JobClient: Reduce input groups=1
11/12/29 10:17:40 INFO mapred.JobClient: Combine output records=3
11/12/29 10:17:40 INFO mapred.JobClient: Map input records=3
11/12/29 10:17:40 INFO mapred.JobClient: Reduce shuffle bytes=892
11/12/29 10:17:40 INFO mapred.JobClient: Reduce output records=1
11/12/29 10:17:40 INFO mapred.JobClient: Spilled Records=6
11/12/29 10:17:40 INFO mapred.JobClient: Map output bytes=1302
11/12/29 10:17:40 INFO mapred.JobClient: Map input bytes=161
11/12/29 10:17:40 INFO mapred.JobClient: Combine input records=3
11/12/29 10:17:40 INFO mapred.JobClient: Map output records=3
11/12/29 10:17:40 INFO mapred.JobClient: Reduce input records=3
11/12/29 10:17:40 INFO main.UpdateIndex: Index update job is done
11/12/29 10:17:40 INFO main.UpdateIndex: Elapsed time is 27s
Elapsed time is 27s
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$ bin/hadoop fs -copyToLocal /user/kelo-dichan/administrator/*.* D:\tmp\testou
tput
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$ bin/hadoop fs -copyToLocal /user/kelo-dichan/administrator/*.* /cygdrive/D/tmp/testoutput
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$
//取出hdfs文件系统中的目录下的所有文件
$ bin/hadoop fs -get /user/kelo-dichan/administrator/index_032/00000/ D:/tmp/testoutput
$ bin/hadoop fs -get /user/kelo-dichan/administrator/index_035/00000/ D:/tmp/testoutput
分布式索引简单总结
1、命令如下
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPath
index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/idex-config.xml
参数说明
路径指的是hdfs分布式系统中的路径
-numShards
-numMapTasks
这两个数值不一样,如都为3时(输入路径中有3个文件),则索引结果少了数据(文档数据少了),暂不知道原因
如果改成组合文件,为是什么样呢
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPat
ndex_msg_out_020 -indexPath index_040 -numShards 3 -numMapTasks 3 -conf conf
dex-config.xml
11/12/29 11:39:19 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 11:39:19 INFO main.UpdateIndex: outputPath = index_msg_out_020
11/12/29 11:39:19 INFO main.UpdateIndex: shards = null
11/12/29 11:39:19 INFO main.UpdateIndex: indexPath = index_040
11/12/29 11:39:19 INFO main.UpdateIndex: numShards = 3
11/12/29 11:39:19 INFO main.UpdateIndex: numMapTasks= 3
11/12/29 11:39:19 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 11:39:20 INFO main.UpdateIndex: sea.index.updater = org.apache.hado
ontrib.index.mapred.IndexUpdater
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localh
18888/user/kelo-dichan/administrator/input/input
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://local
:18888/user/kelo-dichan/administrator/index_msg_out_020
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.map.tasks = 3
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.reduce.tasks = 3
11/12/29 11:39:20 INFO mapred.IndexUpdater: 3 shards = -1@index_040/00000@-1
index_040/00001@-1,-1@index_040/00002@-1
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.input.format.class = org.
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 11:39:20 WARN mapred.JobClient: Use GenericOptionsParser for parsin
e arguments. Applications should implement Tool for the same.
11/12/29 11:39:20 INFO mapred.FileInputFormat: Total input paths to process
11/12/29 11:39:20 INFO mapred.JobClient: Running job: job_201112291106_0009
11/12/29 11:39:21 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 11:39:30 INFO mapred.JobClient: map 33% reduce 0%
11/12/29 11:39:34 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 11:39:40 INFO mapred.JobClient: map 100% reduce 7%
11/12/29 11:39:43 INFO mapred.JobClient: map 100% reduce 14%
11/12/29 11:39:46 INFO mapred.JobClient: map 100% reduce 40%
11/12/29 11:39:49 INFO mapred.JobClient: map 100% reduce 66%
11/12/29 11:39:52 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 11:39:54 INFO mapred.JobClient: Job complete: job_201112291106_0009
11/12/29 11:39:54 INFO mapred.JobClient: Counters: 18
11/12/29 11:39:54 INFO mapred.JobClient: Job Counters
11/12/29 11:39:54 INFO mapred.JobClient: Launched reduce tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: Launched map tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: Data-local map tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: FileSystemCounters
11/12/29 11:39:54 INFO mapred.JobClient: FILE_BYTES_READ=2648
11/12/29 11:39:54 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/29 11:39:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3279
11/12/29 11:39:54 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1025
11/12/29 11:39:54 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 11:39:54 INFO mapred.JobClient: Reduce input groups=2
11/12/29 11:39:54 INFO mapred.JobClient: Combine output records=3
11/12/29 11:39:54 INFO mapred.JobClient: Map input records=3
11/12/29 11:39:54 INFO mapred.JobClient: Reduce shuffle bytes=948
11/12/29 11:39:54 INFO mapred.JobClient: Reduce output records=2
11/12/29 11:39:54 INFO mapred.JobClient: Spilled Records=6
11/12/29 11:39:54 INFO mapred.JobClient: Map output bytes=1350
11/12/29 11:39:54 INFO mapred.JobClient: Map input bytes=161
11/12/29 11:39:54 INFO mapred.JobClient: Combine input records=3
11/12/29 11:39:54 INFO mapred.JobClient: Map output records=3
11/12/29 11:39:54 INFO mapred.JobClient: Reduce input records=3
11/12/29 11:39:54 INFO main.UpdateIndex: Index update job is done
11/12/29 11:39:54 INFO main.UpdateIndex: Elapsed time is 33s
Elapsed time is 33s
文件系统是hadoop 分布式文件系统
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8888</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. file:/// hdfs://localhost:8888</description>
</property>
文件系统是本地,这种方式也很好 core_site.xml
<property>
<name>fs.default.name</name>
<value>file:///</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. file:/// hdfs://localhost:8888</description>
</property>
以下是本地文件系统使用例字 testdata/input 本地的相对目录(全路径是D:/hadoop/run/testdata/input D:/hadoop/run是安装路径)
jar hadoop-0.20.1-examples.jar wordcount testdata/input output-dir1
bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths testdata/input -outputPath index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/idex-config.xml
hadoop-0.20.2
./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input/input output-di
/cygdrive/d/tmp/testdata/input
在eclipse 中使用mapreduce
环境配置所需要的
eclipse 3.3
hadoop 0.20.2 中的hadoop-0.20.2-eclipse-plugin.jar
调式脚本启动(运行如下脚本后,在eclipse调试同一个程序,并使用远程调试方式(可配置))
./bin/hddebug jar hadoop-0.20.2-examples.jar wordcount input/input output-di
Listening for transport dt_socket at address: 28888
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not
exist: hdfs://127.0.0.1:8888/user/kelo-dichan/administrator/input/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(File
InputFormat.java:224)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileI
nputFormat.java:241)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
79)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra
mDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
详细步
1、参考
1、先在win7下配置好hadoop一般可使用
2、然后把bin/hadoop 脚本copy一份,重新命名,叫hddebug
3、并在hddebug中增加一行
如下 即在 if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then增加
HADOOP_OPTS="$HADOOP_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,address=28888,server=y,suspend=y"
4、运行
./bin/hddebug jar hadoop-0.20.2-examples.jar wordcount input/input output-di
可看到 Listening for transport dt_socket at address: 28888
5、启动eclipse 调试wordcount这个代码
菜单,调试-设置成远程调试即可进行调试了
authorized_keys
/cygdrive/D/Java/jdk1.6.0
/cygdrive/D/tmp/testdata/input
/cygdrive/D/tmp/testoutput
D:\tmp\testoutput
hadoop namenode -formate
D:\tmp\testdata\input
上传 input下的文件到 dfs中的input文件中
$ ./bin/hadoop fs -put D:/tmp/testdata/input input
jar hadoop-0.20.1-examples.jar wordcount input/input output-dir ,其中hadoop-0.16.4-examples.jar
$ ./bin/hadoop jar hadoop-0.20.1-examples.jar wordcount input/input output-di
r
11/12/28 17:39:40 INFO input.FileInputFormat: Total input paths to process : 3
11/12/28 17:39:41 INFO mapred.JobClient: Running job: job_201112281720_0003
11/12/28 17:39:42 INFO mapred.JobClient: map 0% reduce 0%
11/12/28 17:39:51 INFO mapred.JobClient: map 66% reduce 0%
11/12/28 17:39:54 INFO mapred.JobClient: map 100% reduce 0%
11/12/28 17:40:03 INFO mapred.JobClient: map 100% reduce 100%
11/12/28 17:40:05 INFO mapred.JobClient: Job complete: job_201112281720_0003
11/12/28 17:40:05 INFO mapred.JobClient: Counters: 17
11/12/28 17:40:05 INFO mapred.JobClient: Job Counters
11/12/28 17:40:05 INFO mapred.JobClient: Launched reduce tasks=1
11/12/28 17:40:05 INFO mapred.JobClient: Launched map tasks=3
11/12/28 17:40:05 INFO mapred.JobClient: Data-local map tasks=3
11/12/28 17:40:05 INFO mapred.JobClient: FileSystemCounters
11/12/28 17:40:05 INFO mapred.JobClient: FILE_BYTES_READ=290
11/12/28 17:40:05 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/28 17:40:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=607
11/12/28 17:40:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=139
11/12/28 17:40:05 INFO mapred.JobClient: Map-Reduce Framework
11/12/28 17:40:05 INFO mapred.JobClient: Reduce input groups=0
11/12/28 17:40:05 INFO mapred.JobClient: Combine output records=16
11/12/28 17:40:05 INFO mapred.JobClient: Map input records=3
11/12/28 17:40:05 INFO mapred.JobClient: Reduce shuffle bytes=221
11/12/28 17:40:05 INFO mapred.JobClient: Reduce output records=0
11/12/28 17:40:05 INFO mapred.JobClient: Spilled Records=32
11/12/28 17:40:05 INFO mapred.JobClient: Map output bytes=284
11/12/28 17:40:05 INFO mapred.JobClient: Combine input records=30
11/12/28 17:40:05 INFO mapred.JobClient: Map output records=30
11/12/28 17:40:05 INFO mapred.JobClient: Reduce input records=16
bin/hadoop jar hadoop-0.20.1-index.jar org.apache.hadoop.contrib.index.main.UpdateIndex -inputPaths input/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 2 -numMapTasks 2 -conf conf/index-config.xml
bin/hadoop jar hadoop-0.20.1-index.jar org.apache.hadoop.contrib.index.main.UpdateIndex -inputPaths D:/tmp/testdata/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths D:/tmp/testdata/input -outputPath index_msg_out_010 -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input(dfs文件系统中的目录) -outputPath index_msg_out_010(dfs文件系统中的目录) -indexPath index_030 -numShards 1 -numMapTasks 1 -conf conf/index-config.xml
11/12/29 10:05:20 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 10:05:20 INFO main.UpdateIndex: outputPath = index_msg_out_010
11/12/29 10:05:20 INFO main.UpdateIndex: shards = null
11/12/29 10:05:20 INFO main.UpdateIndex: indexPath = index_030
11/12/29 10:05:20 INFO main.UpdateIndex: numShards = 1
11/12/29 10:05:20 INFO main.UpdateIndex: numMapTasks= 1
11/12/29 10:05:20 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 10:05:21 INFO main.UpdateIndex: sea.index.updater = org.apache.hadoop.c
ontrib.index.mapred.IndexUpdater
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localhost:
18888/user/kelo-dichan/administrator/input/input
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://localhost
:18888/user/kelo-dichan/administrator/index_msg_out_010
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.map.tasks = 1
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.reduce.tasks = 1
11/12/29 10:05:21 INFO mapred.IndexUpdater: 1 shards = -1@index_030/00000@-1
11/12/29 10:05:21 INFO mapred.IndexUpdater: mapred.input.format.class = org.apac
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 10:05:21 WARN mapred.JobClient: Use GenericOptionsParser for parsing th
e arguments. Applications should implement Tool for the same.
11/12/29 10:05:21 INFO mapred.FileInputFormat: Total input paths to process : 3
11/12/29 10:05:23 INFO mapred.JobClient: Running job: job_201112281720_0005
11/12/29 10:05:24 INFO mapred.JobClient: map 0% reduce 0%
运行成功
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPath
index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/i
dex-config.xml
11/12/29 10:17:12 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 10:17:12 INFO main.UpdateIndex: outputPath = index_msg_out_012
11/12/29 10:17:12 INFO main.UpdateIndex: shards = null
11/12/29 10:17:12 INFO main.UpdateIndex: indexPath = index_032
11/12/29 10:17:12 INFO main.UpdateIndex: numShards = 1
11/12/29 10:17:12 INFO main.UpdateIndex: numMapTasks= 1
11/12/29 10:17:12 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 10:17:13 INFO main.UpdateIndex: sea.index.updater = org.apache.hadoop
ontrib.index.mapred.IndexUpdater
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localhos
18888/user/kelo-dichan/administrator/input/input
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://localho
:18888/user/kelo-dichan/administrator/index_msg_out_012
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.map.tasks = 1
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.reduce.tasks = 1
11/12/29 10:17:13 INFO mapred.IndexUpdater: 1 shards = -1@index_032/00000@-1
11/12/29 10:17:13 INFO mapred.IndexUpdater: mapred.input.format.class = org.ap
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 10:17:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing
e arguments. Applications should implement Tool for the same.
11/12/29 10:17:13 INFO mapred.FileInputFormat: Total input paths to process :
11/12/29 10:17:13 INFO mapred.JobClient: Running job: job_201112291014_0002
11/12/29 10:17:14 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 10:17:22 INFO mapred.JobClient: map 66% reduce 0%
11/12/29 10:17:25 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 10:17:32 INFO mapred.JobClient: map 100% reduce 22%
11/12/29 10:17:38 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 10:17:40 INFO mapred.JobClient: Job complete: job_201112291014_0002
11/12/29 10:17:40 INFO mapred.JobClient: Counters: 18
11/12/29 10:17:40 INFO mapred.JobClient: Job Counters
11/12/29 10:17:40 INFO mapred.JobClient: Launched reduce tasks=1
11/12/29 10:17:40 INFO mapred.JobClient: Launched map tasks=3
11/12/29 10:17:40 INFO mapred.JobClient: Data-local map tasks=3
11/12/29 10:17:40 INFO mapred.JobClient: FileSystemCounters
11/12/29 10:17:40 INFO mapred.JobClient: FILE_BYTES_READ=2104
11/12/29 10:17:40 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/29 10:17:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2910
11/12/29 10:17:40 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=622
11/12/29 10:17:40 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 10:17:40 INFO mapred.JobClient: Reduce input groups=1
11/12/29 10:17:40 INFO mapred.JobClient: Combine output records=3
11/12/29 10:17:40 INFO mapred.JobClient: Map input records=3
11/12/29 10:17:40 INFO mapred.JobClient: Reduce shuffle bytes=892
11/12/29 10:17:40 INFO mapred.JobClient: Reduce output records=1
11/12/29 10:17:40 INFO mapred.JobClient: Spilled Records=6
11/12/29 10:17:40 INFO mapred.JobClient: Map output bytes=1302
11/12/29 10:17:40 INFO mapred.JobClient: Map input bytes=161
11/12/29 10:17:40 INFO mapred.JobClient: Combine input records=3
11/12/29 10:17:40 INFO mapred.JobClient: Map output records=3
11/12/29 10:17:40 INFO mapred.JobClient: Reduce input records=3
11/12/29 10:17:40 INFO main.UpdateIndex: Index update job is done
11/12/29 10:17:40 INFO main.UpdateIndex: Elapsed time is 27s
Elapsed time is 27s
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$ bin/hadoop fs -copyToLocal /user/kelo-dichan/administrator/*.* D:\tmp\testou
tput
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$ bin/hadoop fs -copyToLocal /user/kelo-dichan/administrator/*.* /cygdrive/D/tmp/testoutput
Administrator@kelo-dichan /cygdrive/d/hadoop/run
$
//取出hdfs文件系统中的目录下的所有文件
$ bin/hadoop fs -get /user/kelo-dichan/administrator/index_032/00000/ D:/tmp/testoutput
$ bin/hadoop fs -get /user/kelo-dichan/administrator/index_035/00000/ D:/tmp/testoutput
分布式索引简单总结
1、命令如下
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPath
index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/idex-config.xml
参数说明
路径指的是hdfs分布式系统中的路径
-numShards
-numMapTasks
这两个数值不一样,如都为3时(输入路径中有3个文件),则索引结果少了数据(文档数据少了),暂不知道原因
如果改成组合文件,为是什么样呢
$ bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths input/input -outputPat
ndex_msg_out_020 -indexPath index_040 -numShards 3 -numMapTasks 3 -conf conf
dex-config.xml
11/12/29 11:39:19 INFO main.UpdateIndex: inputPaths = input/input
11/12/29 11:39:19 INFO main.UpdateIndex: outputPath = index_msg_out_020
11/12/29 11:39:19 INFO main.UpdateIndex: shards = null
11/12/29 11:39:19 INFO main.UpdateIndex: indexPath = index_040
11/12/29 11:39:19 INFO main.UpdateIndex: numShards = 3
11/12/29 11:39:19 INFO main.UpdateIndex: numMapTasks= 3
11/12/29 11:39:19 INFO main.UpdateIndex: confPath = conf/index-config.xml
11/12/29 11:39:20 INFO main.UpdateIndex: sea.index.updater = org.apache.hado
ontrib.index.mapred.IndexUpdater
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.input.dir = hdfs://localh
18888/user/kelo-dichan/administrator/input/input
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.output.dir = hdfs://local
:18888/user/kelo-dichan/administrator/index_msg_out_020
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.map.tasks = 3
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.reduce.tasks = 3
11/12/29 11:39:20 INFO mapred.IndexUpdater: 3 shards = -1@index_040/00000@-1
index_040/00001@-1,-1@index_040/00002@-1
11/12/29 11:39:20 INFO mapred.IndexUpdater: mapred.input.format.class = org.
he.hadoop.contrib.index.example.LineDocInputFormat
11/12/29 11:39:20 WARN mapred.JobClient: Use GenericOptionsParser for parsin
e arguments. Applications should implement Tool for the same.
11/12/29 11:39:20 INFO mapred.FileInputFormat: Total input paths to process
11/12/29 11:39:20 INFO mapred.JobClient: Running job: job_201112291106_0009
11/12/29 11:39:21 INFO mapred.JobClient: map 0% reduce 0%
11/12/29 11:39:30 INFO mapred.JobClient: map 33% reduce 0%
11/12/29 11:39:34 INFO mapred.JobClient: map 100% reduce 0%
11/12/29 11:39:40 INFO mapred.JobClient: map 100% reduce 7%
11/12/29 11:39:43 INFO mapred.JobClient: map 100% reduce 14%
11/12/29 11:39:46 INFO mapred.JobClient: map 100% reduce 40%
11/12/29 11:39:49 INFO mapred.JobClient: map 100% reduce 66%
11/12/29 11:39:52 INFO mapred.JobClient: map 100% reduce 100%
11/12/29 11:39:54 INFO mapred.JobClient: Job complete: job_201112291106_0009
11/12/29 11:39:54 INFO mapred.JobClient: Counters: 18
11/12/29 11:39:54 INFO mapred.JobClient: Job Counters
11/12/29 11:39:54 INFO mapred.JobClient: Launched reduce tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: Launched map tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: Data-local map tasks=3
11/12/29 11:39:54 INFO mapred.JobClient: FileSystemCounters
11/12/29 11:39:54 INFO mapred.JobClient: FILE_BYTES_READ=2648
11/12/29 11:39:54 INFO mapred.JobClient: HDFS_BYTES_READ=161
11/12/29 11:39:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3279
11/12/29 11:39:54 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1025
11/12/29 11:39:54 INFO mapred.JobClient: Map-Reduce Framework
11/12/29 11:39:54 INFO mapred.JobClient: Reduce input groups=2
11/12/29 11:39:54 INFO mapred.JobClient: Combine output records=3
11/12/29 11:39:54 INFO mapred.JobClient: Map input records=3
11/12/29 11:39:54 INFO mapred.JobClient: Reduce shuffle bytes=948
11/12/29 11:39:54 INFO mapred.JobClient: Reduce output records=2
11/12/29 11:39:54 INFO mapred.JobClient: Spilled Records=6
11/12/29 11:39:54 INFO mapred.JobClient: Map output bytes=1350
11/12/29 11:39:54 INFO mapred.JobClient: Map input bytes=161
11/12/29 11:39:54 INFO mapred.JobClient: Combine input records=3
11/12/29 11:39:54 INFO mapred.JobClient: Map output records=3
11/12/29 11:39:54 INFO mapred.JobClient: Reduce input records=3
11/12/29 11:39:54 INFO main.UpdateIndex: Index update job is done
11/12/29 11:39:54 INFO main.UpdateIndex: Elapsed time is 33s
Elapsed time is 33s
文件系统是hadoop 分布式文件系统
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8888</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. file:/// hdfs://localhost:8888</description>
</property>
文件系统是本地,这种方式也很好 core_site.xml
<property>
<name>fs.default.name</name>
<value>file:///</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem. file:/// hdfs://localhost:8888</description>
</property>
以下是本地文件系统使用例字 testdata/input 本地的相对目录(全路径是D:/hadoop/run/testdata/input D:/hadoop/run是安装路径)
jar hadoop-0.20.1-examples.jar wordcount testdata/input output-dir1
bin/hadoop jar hadoop-0.20.1-index.jar -inputPaths testdata/input -outputPath index_msg_out_012 -indexPath index_032 -numShards 1 -numMapTasks 1 -conf conf/idex-config.xml
hadoop-0.20.2
./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input/input output-di
/cygdrive/d/tmp/testdata/input
在eclipse 中使用mapreduce
环境配置所需要的
eclipse 3.3
hadoop 0.20.2 中的hadoop-0.20.2-eclipse-plugin.jar
调式脚本启动(运行如下脚本后,在eclipse调试同一个程序,并使用远程调试方式(可配置))
./bin/hddebug jar hadoop-0.20.2-examples.jar wordcount input/input output-di
Listening for transport dt_socket at address: 28888
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not
exist: hdfs://127.0.0.1:8888/user/kelo-dichan/administrator/input/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(File
InputFormat.java:224)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileI
nputFormat.java:241)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
79)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra
mDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
详细步
1、参考
1、先在win7下配置好hadoop一般可使用
2、然后把bin/hadoop 脚本copy一份,重新命名,叫hddebug
3、并在hddebug中增加一行
如下 即在 if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then增加
HADOOP_OPTS="$HADOOP_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,address=28888,server=y,suspend=y"
4、运行
./bin/hddebug jar hadoop-0.20.2-examples.jar wordcount input/input output-di
可看到 Listening for transport dt_socket at address: 28888
5、启动eclipse 调试wordcount这个代码
菜单,调试-设置成远程调试即可进行调试了
发表评论
-
hadoop 索引相关记录
2012-01-10 17:21 859hadoop 分布式索引升级包 https://issues. ... -
windows hadoop 调试环境
2011-12-31 13:45 576eclipse 3.3 hadoop 0.20.2 中的h ... -
hadoop主节点(NameNode)备份策略以及恢复方法
2011-12-30 17:40 0hadoop主节点(NameNode)备份策略以及恢复方法 ... -
hadoop job提交完成的整个过程介绍 zz
2011-12-30 16:52 15602009-11-17 11:16http://blog.chi ... -
Hadoop中常出现的错误以及解决方法zz
2011-12-30 16:52 842引用2009-11-18 15:421:Shuff ... -
分布式 Lucene
2011-12-27 13:54 728http://www.hadooper.cn/dct/page ... -
使用Eclipse3.4编译部署Hadoop/Hbase工程时需要修正的BUG(转)
2011-06-09 19:52 1418引用Posted in Java, FreeBSD/Unix服 ... -
hadoop 0.20.1在 windows下编译
2010-11-12 09:18 1256必备条件 1\ant 2\cygwin 3\在eclipse ... -
Avro总结(RPC/序列化)
2010-10-20 17:04 1529Avro(读音类似于[ævr ...
相关推荐
非常详细的linux上的hadoop集群搭建文档,可供参考,希望大家能够从中获益
hadoop搭建集群笔记.虚拟机搭建hadoop集群
hadoop集群安装笔记,我个人在自己的本机虚拟3台机器,搭建hadoop的学习环境,在安装过程中,出现这样那样的问题,并记录下来分享给大家,希望都新手有帮助。
云计算,hadoop,学习笔记, dd
大数据Hadoop的一些配置与底层原理,里面详细介绍了从0到1搭建集群以及搭建过程中遇到的问题解决方案,并且由图去更好的理解Hadoop的用处。
Hadoop HA模式搭建的图文笔记、tar包工具包和配置文件等
Hadoop学习笔记,自己总结的一些Hadoop学习笔记,比较简单。
Hadoop云计算2.0笔记第一课Hadoop介绍
IT十八掌第三期配套课堂笔记 Hadoop架构分析之集群结构分析,Hadoop架构分析之HDFS架构分析,Hadoop架构分析之NN和DN原生文档解读,Hadoop MapReduce原理之流程图.Hadoop MapReduce原理之核心类Job和ResourceManager...
Hadoop知识点笔记
hadoop 笔记
Hadoop 权威指南读书笔记 我自己画的一张图
Hadoop 学习笔记.md
hadoop学习笔记,hadoop简介,适用于hadoop入门,讲解hadoop安装,使用,基本原理,大数据,分布式等概念
Hadoop权威指南----读书笔记
hadoop学习笔记,分天学习,原创,亲测,放心使用,没问题。
hadoop 学习笔记,从搭建环境开始到具体实验。包括hdfs配置,yarn配置,分布式配置,如何编写mapreuduce 一步一步手把手,最后项目是hadoop 与 javaweb
Hadoop分布式安装笔记.rar
我学习hadoop的笔记,并在公司做的报告,给大家共享下