Saturday, December 14, 2013
Sunday, December 8, 2013
No Shared Resource
As I was pretty busy in one of the big data project, I'm kind of busy in Production move and related activities.
One of the key lesson learnt is not to use network access storage like NAS/SAN. It is highly recommended to use the local storage with the key factor 'data locality'. You can understand this concept by understanding how MapReduce and HDFS works in Big Data world.
As you know, HDFS is a distributed file system. Think of it as RAID over the Internet. What if you had a 10GB file and 10 servers, and each disk could burst 1GBps? Assume your Ethernet is also 10GB. Well, if you read the file from one server at one time and got back 1GBps, it would take 10 seconds. What if you could read all 10GB at once? In essence, that is what HDFS allows: a burst from your cluster that's bigger than the burst you could get from any individual node.
As another point, MapReduce is built by Google with the divide and conquer concept. The given problem is broken into small independent chunk and sent to each node, as part of MapReduce algorithm. On completion of the calculated answers in parallel, the result is sent back and combined as reduced concept. If your application add in network hops and latency along with multiple nodes contending for the same resources, then you loose the power of high performance by choosing hadoop in shared platform.
As the last point, the application doesn't have the priority control on the shared resource like NAS/SAN, by default. So, the app loose its performance benchmark even though it is running under MapReduce concept.
The conclusion is not to use the shared resource, on building the high performance, massively parallel processing MapReduce methodology.
One of the key lesson learnt is not to use network access storage like NAS/SAN. It is highly recommended to use the local storage with the key factor 'data locality'. You can understand this concept by understanding how MapReduce and HDFS works in Big Data world.
As you know, HDFS is a distributed file system. Think of it as RAID over the Internet. What if you had a 10GB file and 10 servers, and each disk could burst 1GBps? Assume your Ethernet is also 10GB. Well, if you read the file from one server at one time and got back 1GBps, it would take 10 seconds. What if you could read all 10GB at once? In essence, that is what HDFS allows: a burst from your cluster that's bigger than the burst you could get from any individual node.
As another point, MapReduce is built by Google with the divide and conquer concept. The given problem is broken into small independent chunk and sent to each node, as part of MapReduce algorithm. On completion of the calculated answers in parallel, the result is sent back and combined as reduced concept. If your application add in network hops and latency along with multiple nodes contending for the same resources, then you loose the power of high performance by choosing hadoop in shared platform.
As the last point, the application doesn't have the priority control on the shared resource like NAS/SAN, by default. So, the app loose its performance benchmark even though it is running under MapReduce concept.
The conclusion is not to use the shared resource, on building the high performance, massively parallel processing MapReduce methodology.
Subscribe to:
Posts (Atom)