24.12.13

http://java.dzone.com/articles/infrastructure-scale-apache
http://java.dzone.com/articles/handling-big-data-hbase-part-5

21.12.13

http://ac31004.blogspot.com/2013/10/installing-hadoop-2-on-mac_29.html
http://apmblog.compuware.com/2013/02/19/speeding-up-a-pighbase-mapreduce-job-by-a-factor-of-15/
http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications
https://labs.ericsson.com/blog/hbase-performance-tuners

20.12.13


  • http://gbif.blogspot.com/2012/07/optimizing-writes-in-hbase.html
  • http://ronxin999.blog.163.com/blog/static/422179202013328105833745/

17.12.13

Install HBase/Cloudera CDH4



  • yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm
  • yum install zookeeper
===============================================================================================================================================
 Package                       Arch                    Version                                            Repository                      Size
===============================================================================================================================================
Installing:
 zookeeper                     noarch                  3.4.5+24-1.cdh4.5.0.p0.23.el6                      cloudera-cdh4                  3.7 M
Installing for dependencies:
 bigtop-utils                  noarch                  0.6.0+186-1.cdh4.5.0.p0.23.el6                     cloudera-cdh4                  8.2 k

  • yum install zookeeper-server
Dependencies Resolved

===============================================================================================================================================
 Package                               Arch                  Version                                        Repository                    Size
===============================================================================================================================================
Installing:
 zookeeper-server                      noarch                3.4.5+24-1.cdh4.5.0.p0.23.el6                  cloudera-cdh4                4.9 k
Installing for dependencies:
 foomatic                              x86_64                4.0.4-1.el6_1.1                                rhel-cd                      251 k
 foomatic-db                           noarch                4.0-7.20091126.el6                             rhel-cd                      980 k
 foomatic-db-filesystem                noarch                4.0-7.20091126.el6                             rhel-cd                      4.3 k
 foomatic-db-ppds                      noarch                4.0-7.20091126.el6                             rhel-cd                       19 M
 pax                                   x86_64                3.4-10.1.el6                                   rhel-cd                       69 k
 perl-CGI                              x86_64                3.51-127.el6                                   rhel-cd                      207 k
 perl-Test-Simple                      x86_64                0.92-127.el6                                   rhel-cd                      110 k
 redhat-lsb                            x86_64                4.0-3.el6                                      rhel-cd                       24 k
 redhat-lsb-graphics                   x86_64                4.0-3.el6                                      rhel-cd                       12 k
 redhat-lsb-printing                   x86_64                4.0-3.el6                                      rhel-cd                       11 k

  • service zookeeper-server init
No myid provided, be sure to specify it in /var/lib/zookeeper/myid if using non-standalone
  • service zookeeper-server start
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED

  •  yum install hadoop-conf-pseudo
===============================================================================================================================================
 Package                                     Arch                Version                                      Repository                  Size
===============================================================================================================================================
Installing:
 hadoop-conf-pseudo                          x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              8.0 k
Installing for dependencies:
 bigtop-jsvc                                 x86_64              1.0.10-1.cdh4.5.0.p0.23.el6                  cloudera-cdh4               27 k
 hadoop                                      x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4               17 M
 hadoop-hdfs                                 x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4               12 M
 hadoop-hdfs-datanode                        x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              4.8 k
 hadoop-hdfs-namenode                        x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              4.9 k
 hadoop-hdfs-secondarynamenode               x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              4.9 k
 hadoop-mapreduce                            x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              9.9 M
 hadoop-mapreduce-historyserver              x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              4.9 k
 hadoop-yarn                                 x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              8.5 M
 hadoop-yarn-nodemanager                     x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              4.8 k
 hadoop-yarn-resourcemanager                 x86_64              2.0.0+1518-1.cdh4.5.0.p0.24.el6              cloudera-cdh4              4.8 k
 nc                                          x86_64              1.84-22.el6                                  rhel-cd                     57 k
 parquet                                     noarch              1.2.5-1.cdh4.5.0.p0.17.el6                   cloudera-cdh4               13 M
 parquet-format                              noarch              1.0.0-1.cdh4.5.0.p0.20.el6                   cloudera-cdh4              489 k

  • skip for the thrift server.





15.12.13

12.12.13

HBase


  • http://research.google.com/archive/bigtable.html
  • http://blog.cloudera.com/blog/2012/06/hbase-write-path/
  • http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/
http://java.dzone.com/articles/how-google-does-code-review

9.12.13

http://www.iteye.com/news/28540-5-linux-shell-commandline-website

4.12.13

What is a good software product?

We all know that we want to build good software products. But what is a good software product? Traditionally, people believe a good software product is the one that matches customer's requirements. So we spends a lot of time to collect requirements from the customer and make contracts upon it.

But really? How many customers feel bad with a software product that meet all the requirements on the paper? If that happens, some would say we didn't understand the customers' requirement good enough.

Please think again. Do the customers really know what they want before hand? They don't. So in XP, people advocate developing software together with representative from the customer side, so that the customer can feedback to the development team and help the team build software eventually, and in the way they really want. That's a good thing, because the customer will eventually realize what they want along with the grow of the product.

However, there are two issues with this kind of development model:

  • It is not the customers' natural duty to help the the development team, although they know they are going to use the system after it is done and hand over. But still, it is not their job by nature.
  • The customer could distract the development in their own way, so that it is difficult to build the system, while they miss many good features that could be build easily.
With those concerns, we have SCRUM and have PO to work with the development team. We solve the first issue because PO is responsible for building the system, but we don't necessarily solve the second one because the PO would still drive the team to the way they want, not the way that a development team is good at.

You may say, we surely want to build the system the PO or the customers want, not a system engineers like to build. Really? Do you ever hear that a good engineer is 1000 times more productive than a bad one? If we try to drag the team to the way that the engineers are not comfortable with, we are risking the productivity of the development team. Another issue, and the more real one, is that it is very difficult to build something but very easy to do another, and only the guys know the detailed technologies could answer which one is easier. Having PO drives the whole development could neglect those difference.

In my mind, software development is very detail-oriented. For example, choosing JMS or Kafka could make a huge different to the system, either for the architecture or the user experience; using HDFS and MapReduce could also make a huge difference to the system than the one using Vertica. Those knowledge is far beyond the customer or the PO could understand, even though we can explain a little. So having some top-down business requirements could be very dangerous to a development team.

You may say how could we drive the real requirements from the market? Then I would ask what is the real requirements from the market? Before the Big Data solution emerging, do we have those data mining, data analytic requirements? Yes we do, but only after Big Data is there, those requirements are become overwhelmingly important. Why? Only a requirement could be done is a requirement that is real. Otherwise, we just tracing millions of things in the world. For example, would it be a good idea to search picture with people exactly the same one in other picture in the Internet? Yes, I am sure millions of users are eager for this feature. But it is not a real requirement for a search engine because it is not practical for a search engine.

Then, back to the topic we are talking about here. What is a good software product? I would say, a good software product is the one that the customer is willing to pay for it. It may or may not meet the requirement that the customer asks, but it is definitely useful for the customer and better than other products the customer could have with other vendor so that the customer want to pay for yours.

Then how could we build such a good product. There are two parts in this definition:
  1. How could we know the product is useful to the customer?
  2. How could we have a solution (generally) better those from others.
For the first question, unfortunately, we could have some hints but not exactly know. Those hints are from the customers or from the POs. But we don't exactly know what would be the most useful features for the customers, because they don't know either. And as I said above, not everything the customers want could be implemented, why there could be something easy to be done and useful to the customers but they don't realize before they see it.

The best we to find out what is useful to the customer is to have them try something that possibly useful. If they like and pay you the money, you got it. If they don't like it, you change it. It looks simple but actually not easy to make it work. 

To be continue....


2.12.13

http://java.dzone.com/articles/scaling-redis-and-rabbitmq