13.8.15

Run-example from Spark in HDP 2.3.0

[spring-xd@14514aea325a ~]$ run-example SparkPi 10
/usr/bin/run-example: line 24: /usr/bin/load-spark-env.sh: No such file or directory
Failed to find Spark examples assembly in /usr/lib or /usr/examples/target
You need to build Spark before running this program

I need to run it like this.

$ /usr/hdp/2.3.0.0-2557/spark/bin/run-example SparkPi 10


11.8.15

A possible Isilon HDFS issue with Hadoop 2.7.1

A simple hadoop fs -cat command may throw the following exception:

java.lang.IllegalStateException: Must not use direct buffers with InputStream API
at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:211)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:201)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:737)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:793)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:853)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:697)
at java.io.DataInputStream.readShort(DataInputStream.java:312)
at org.apache.hadoop.fs.shell.Display$Text.getInputStream(Display.java:136)
at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:102)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)

at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)


When looking into this artitle:

http://blog.sequenceiq.com/blog/2014/03/07/read-from-hdfs/

I found a possible solution and it works for me:

        <property>
                <name>dfs.client.use.legacy.blockreader</name>
                <value>true</value>
        </property>

7.8.15

Spring XD File Endpoint

src/main/java/org/springframework/integration/file/config
src/main/java/org/springframework/integration/file/event
src/main/java/org/springframework/integration/file/filters
src/main/java/org/springframework/integration/file/locking
src/main/java/org/springframework/integration/file/remote
src/main/java/org/springframework/integration/file/remote/gateway
src/main/java/org/springframework/integration/file/remote/handler
src/main/java/org/springframework/integration/file/remote/session
src/main/java/org/springframework/integration/file/remote/synchronizer
src/main/java/org/springframework/integration/file/splitter
src/main/java/org/springframework/integration/file/support
src/main/java/org/springframework/integration/file/tail
src/main/java/org/springframework/integration/file/transformer


  • config
    • Parser
      • FileInboundChannelAdapterParser
      • FileOutboundChannelAdapterParser
      • FileOutboundGatewayParser
      • FileParserUtils
      • FileSplitterParser
      • FileTailInboundChannelAdapterParser
      • FileToByteArrayTransformerParser
      • FileToStringTransformerParser
      • RemoteFileOutboundChannelAdapterParser

                  • event
                  • filters
                  • locking
                  • remote
                    • gateway
                    • handler
                    • session
                    • synchronizer
                  • splitter
                  • support
                  • tail
                  • transformer


                  Spring XD


                  上面这个可以是安装整个cluster的参考。不过现在我只想专心于Spring XD的配置和开发。


                  这个介绍有点意思,之前看过,但看完后印象不深,现在基本上忘得差不多了。所以还是需要实践,要接地气一点。

                    • wget下载
                    • rpm安装
                    • 然后service运行两个业务:spring-xd-admin,spring-xd-container

                  注意还要装jdk

                  安装好后,就可以顺着这个Quick Start来看看怎样用了。

                  基本上可以实验到--deploy --destroy 一个ticktock实验,接下来就很多关于cluster的东西。

                  当然可以选择马上进入cluster,不过作为一个job scheduler,我觉得可以先考虑学习spring-xd tasklet的编程。

                  Spring XD is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export. The foundations of XD's architecture are based on the over 100+ man years of work that have gone into the Spring Batch, Integration and Dat projects. Building upon these projects, Spring XD provides servers and a configuration DSL that you can immediately use to start processing data. You do not need to build an application yourself from a collection of jars to start using Spring XD.

                    • Runtime Architecture
                  The key components in Spring XD are the XD Admin and XD Container Servers. Using a high-level DSL, you post the description of the required processing task to the Admin server over HTTP. The Admin server then maps the processing tasks into processing modules. A module is a unit of execution and is implemented as Spring ApplicationContext. A distributed runtime is provided that will assign modules to execute across multiple XD Container servers. A single XD Container server can run multiple modules. When using the single node runtime, all modules are run in a single XD Container and the XD Admin is run in the same process.

                    • DIRT Runtime
                  A distributed runtime, called Distributed Integration Runtime, aka DIRT, will distribute the processing tasks across multiple XD Container instances. The XD Admin server breaks up a processing task into  individual module definitions and assigns each module to a container instance using ZooKeeper. Each container listens for module definitions to which it has been assigned and deploys the module, creating a Spring ApplicationContext to run it.

                  distributed node

                  Modules share data by passing messages using a configured messaging middleware (Rabbit, Redis, or Local for single node). To reduce the number of hops across messaging middleware between them, multiple modules may be composed into larger deployment units that act as a single modules. 

                  Import concept:

                  • Container Server Architecture
                    • Streams
                      • Streams define how event driven data is collected, processed, and stored or forwarded. For example, a stream might collect syslog data, filter, and store it in HDFS.
                    • Jobs
                      • Jobs define how coarse grained and time consuming batch processing steps are orchestrated, for example a job could be defined to coordinate performing HDFS operations and the subsequent execution of multiple MapReduce processing tasks.
                    • Taps
                      • Taps are used to process data in a non-invasive way as data is being processed by a Stream or a Job. Much like wiretaps used on telephones, a Tap on a Stream lets you consume data at any point along the Stream's processing pipeline. The behavior of the original stream is unaffected by the presence of the Tap.

                  tap jobs streams


                  The programming model for processing streams in Spring XD is based on the well known Enterprise Integration Patterns as implemented by components in the Spring Integration project. The programming model was designed so that it is easy to test components.

                  • Stream Deployment
                  The Container Server listens for module deployment events initiated from the Admin Server via ZooKeeper. When the container node handles a module deployment event, it connects the module's input and output channels to the data bus used to transport messages during stream processing. In a single node configuration, the data bus uses in-memory direct channels. In a distributed configuration , the data bus communications are backed by the configured trasnport middleware. Redis and Rabbit are both provided with the Spring XD distribution, but  other transports are envisioned for future release.

                  anatomyOfAStreamSingleNode



                  anatomyOfAStreamV3