26.2.14

The Trick of Software Development

How many people would claim they know how to build successful software? I doubt there are many, although I am sure there are many people can help improving part of the system, identifying performance bottlenecks, and reducing memory footprints. But building software from scratch? It is way too difficult.

Why? Simply because this is the way software is built. To be more specific, there are two reasons for that:

Firstly, software is built by people mentally, instead of physically. There are many other things are built mentally, such as paintings, poems, and music. However, look at all those other things, how many are built by group of people? And how many of them need to be "useful"? For software, it need to build by group of people, and it need to be useful.

Secondary, software system are based on a rapidly changing industry, which is the silicon industry. The Moore's Law is the destiny of software engineering, and where the miserable comes from. One day a friend of mime argued with me why software engineering couldn't work as civil engineering. I asked him to imagine if the material used for constructions became 50% lighter in 18 months, 50% thinner in 18 months, what would happen in this industry. Then he got my point. And yes, this is where the problems come from. The whole industry of software engineering essentially dynamic. I really don't think that Product managers know what software we should be, and I would doubt any software architect know how to build a piece of software. Basically, software engineering doesn't work top-down. With less experience for lower level of engineers, it can't be bottom-up either. So how to build our software? To make things even worse, I don't think our customers know what software they need.

So basically, we are at an industry that no one knows how to build the products, and no way to build the product right. We are screwed from beginning, like it or not.

The only way to build a software product, in my opinion, is to just-build-it. Really? Let me talk about it a little bit more.

I would like to define some kind of distances for software development. For example, the distance of test cycle, the distance of feature development cycle. These two are the most significant ones of them. So Unit Test are so important, because we reduce the distance of test cycle into some sub-second ones. But for feature development cycle, let's talk about some "normal" way.

In some teams, if it is not most teams, we have Product Managers to "collect" a list of features for the development teams. Whether or not being Agile, the features will be pushed down to the development team, and the team try their best to develop them or feedback to the Product managers, or the POs. Most of time there are some back-and-forth processes until we settle down, and there come our features.

In this way, I saw some teams who develop some simple feature in half a year, while those features could be implemented some other ways in several days. The communication between engineers and Product managers are so expensive, not to mention there are other communication costs between engineers. 

The first Agile manifesto is: individuals and iterations over processes and tools. 

But what does it mean? Different people have different view though. My experience tells me that we need to encourage our engineers to help finding the right features we really want to build, not having them just simply top-down. For the POs, they simply need to understand what those features are, and how they could be useful to the customers, and prepare to explain to the customers that why it doesn't work in the way the customers are thinking about.

This is how Google is doing I guess, according to the book "How Google Tests Software". And I truly believed this because I found out many good features in my products were from those short distance of development. 

There are many other tricks in software development. The rule of thumb is being Agile, which I truly believe in. But Agility is not just a word, we need to make it real.

The Safenet and Reusability

When I am designing a software product, I would like to identify the risk levels of the components. The risk levels will be considered both from the characteristics of the technologies the components are using, and from the customers' point of view. For example, I would identify the risk of ActiveMQ broker as high, and database connection as medium or high.

There are two reasons to do that.

First of all, but identifying the risk level of the components, you will put different components in different places and treat them differently. For example, you don't want to build an ActiveMQ embedded broker into you Central Controller System, because your customer won't effort to crash the Central Controller, or they don't even want to restart it.

Secondary, and this is my favorite, you want to build some safe net for those high risk components, and make sure when bad things happen, your customers won't freak out. They don't have to know most of the case. This is my 80/20 principle: in most of the case, you can't handle all situations, building safe net will save you much precious time to focus on the business you need to focus on.

However, not everyone understand this is the life-saving trick.

Today I heard about a related story. In their product, there is a safe net was designed to monitor and restart the JMS broker. This safe net tool was built depending on several other home-made, unreliable components, and a list of unnecessary business logic. I heard that that reason to do that is to make sure the reusability. Finally, a defect comes along.

I don't really care about the defect. The problem is not about the defect, but the way we should build a safe net solution.

Please look at this. What do we want reusability? Software reusability is very important. We don't want to duplicated our code every where, making similar logic all over again and again. By doing this, we can easily improve our system by changing few places. It is good for debugging, system evolution, and even change design.

However, the goal of reusability is to make development easier and the software better, not harder. Sometimes, I saw people try to make things reusable just for reusability. not anything else. In this way, they make the system much more complex than it should have been. A lot of unnecessary logic dependencies are put onto the system.

It is very difficult to argue this. One thing I would suggest the engineers to look at is coupling. When you are building one thing, how many other things you need to worry about? However, this is also arguable.

But for the safe net solution, you must need to guarantee it work straightforwardly. This is the way you survive the system, don't put any more risks onto it.

For other parts of the system, I would suggest we need to do more book reading and third party software studying. For re-usability, please read the GoF's Design Patterns, look at the good and the bad and truly understand what they are and how to use them. Learnt from successful open-source system, study their architecture and design principles. Don't re-invented the wheel. It is the time of global corporation, we need to learn and grow with the whole world.



19.2.14

Quick Performance Test for Vertica on IEEE Floating Point Format Compression/Decompression Performance

I didn't think about it in this way before. But today I did some test and finally hit the performance issue. Surprisingly, it is not about I/O, but CPU.

I am planning to do a standard deviation calculation on some 4 million data item. Each data item has 50 columns, and 4,032 samples. So basically it need to handle 200 million data item, echo of which has about 4K samples.

I began my experiment with this and found out that it took me 39 minutes to finish the calculation. What? This calculation is supposed to run every 1 hour, and I probably going to handle 100 million data item, and possibly 50 K samples.

I suspected this was because Vertica has problem not to use enough memory to cache the intermediate result. I got more than 100G RAM on each of the Vertica node, but it only used 12G for the calculation. However, after I did a simple match, I found out it 12G was all it needed.

To verify this, I did the following two tests.

Test 2a: do the calculation on only one column. It took 3 minutes.
Test 2b: do the calculation on all columns but don't distinguish the data items. Which means the results will be 50 double values, instead of 200 millions values. It took 27 minutes to finish.

Obviously, this was not about how to use memory, but about traversing the data. When I looked into the result on Linux top command, it clearly showed me that CPU has big work load, which made the system load up to 30 or even 40 sometimes. My system has 24 CPU/cores, system load as 30 means full workload on the CPU/cores.



16.2.14

2PC and 2PL


2PL

2PC
PAXOS

Understand epoch of Vertica



An epoch is associated with each COMMIT - the current_epoch at the time of the COMMIT is the epoch for that load. Vertica supports historical queries, though it's not a common use case for most customers. You can only query epochs that are after the current AHM, which is kept aggressively current by default. Deleted data prior to the AHM (Ancient History Mark) is eligible for being purged when a mergeout or explicit purge happens. After it's purged, delete vectors no longer need to be maintained. The Last Good Epoch is the epoch at which all data has been written from WOS to ROS. Any data after the LGE will be lost if the cluster shuts down abnormally from something like a power loss or a set of exceptions across multiple nodes. Refresh Epoch - don't worry about it, it doesn't get referenced in practice. 


dbadmin=> select current_epoch from system;
current_epoch 
---------------
44
(1 row)

dbadmin=> insert into epochs values(1); commit;
OUTPUT 
--------
1
(1 row)

COMMIT
dbadmin=> select current_epoch from system;
current_epoch 
---------------
45
(1 row)

dbadmin=> insert into epochs values(2); commit;
OUTPUT 
--------
1
(1 row)

COMMIT
dbadmin=> select current_epoch from system;
current_epoch 
---------------
46
(1 row)

dbadmin=> select * from epochs;

---
1
2
(2 rows)

dbadmin=> at epoch 45 select * from epochs;

---
1
2
(2 rows)

dbadmin=> at epoch 44 select * from epochs;

---
1
(1 row)


dbadmin=> select make_ahm_now();
make_ahm_now 
-----------------------------
AHM set (New AHM Epoch: 46)
(1 row)

dbadmin=> at epoch 45 select * from epochs;
ERROR 2318: Can't run historical queries at epochs prior to the Ancient History Mark
Sharon Cutter
Independent Consultant for Vertica