Saturday, March 8, 2014

Apache Spark

Apache Spark, an in-memory data-processing framework, an important step for Spark’s stability as it increasingly replaces MapReduce in next-generation big data applications. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

It is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms. Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data.

Apache Spark™ is a fast and general engine for large-scale data processing.The Apache Software Foundation announced recently that Spark has graduated from the Apache Incubator to become a top-level Apache project, signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles. This is a major step for the community and we are very proud to share this news with users as we complete Spark’s move to Apache.

No comments:

Post a Comment