Important points on Apache spark

Spark Basics:

Iterative programs (ML)
Spark Scala(Functional programming language) has: Immutability, Lazy transformation(Execution before evaluation), type inferred, Because of Immutability, we can Cache & distribute. RDD is a big collection of data structure.
RDD is big data collection of with properties: immutable,distributed & lazy evaluation,time Inference, resilient(Fault-tolerant) & cacheable.
Spark Remembers all its transformation, Transformation doest apply any action. So it has multiple copies.(Bad!!!!!!!!! )
Scala code runs on top of JVM.
Spark-shell is interactive.
var means immutable. but immutable value can be mapped to new value.
Interactive Queries
Real Time & batch processing unified
Good use of resources (Multi-core),Network speed,disk
Velocity is as much important as Volume.
Realtime processing is as much important as Batch processing
Existing map-reduce has tightly coupled with API.
Spark makes use of Hadoop distributed storage.
In-memory processing
Unified Big-data processing platform. => Talk to hadoop, talk to any noSQL, talk to any Mesos/Yarn,any kind of processing
Good for distributers, good for developers, good for users.
All different processing platform share same abstraction i.e RDD.
RDD can be used by ML programs .
First version of SPARK was just 1600 lines of code; small/simple & modular program.
In-memory processing of DNs main memory; fault tolerance of the cache data;snuffling of data also can be done in-memory
Multi-language API like Java/Python/Scala & we can do SQL using sparkSQL.
Partition => Logical division of data only processiong; for distributed processing (Input data,intermediate data,O/p data); basic units of parallelism. RDD is a collection of Partitions. Uses Hadoop partitioning API for dividing data. Hadoop block = spark partitions by default; but this ratio can be changed.
Partitions are by default are immutable as spark context calls the hdfs API & and blocks are immutable because u can't change file content via. APIs
RDD is resilient because the underneath data is also fault tolerant.
RDD.partitions => Gives all the partitions of the RDD.
Accessing the partition: mapPartitions(iterator)/mapPartitionWithIndex; where do u want to access all partitions => reducing operations, or u want to do sequential operations. Matrix operations on partitions.
Find the min & maximum of a partition => read whole partition as a iterator
Transformations of partitions will not be same as that input partition, hashPartitions(group by Key/reduce by Key). Changing the no of partitions depends up on logic i.e Key, which partition it belongs to; programming can provide the no of partitions required after shuffling. After re-partitions new RDD can be created, faster Lookup after hashPartition.
Parent RDDs during transformation; so b4 computing its value it asks for parent RDD.Dependancy mgmt allows Lazzyness. Its like a Chain. It has to remember the transformation, subclass of RDD remembers the operation.
Laziness transformation & later use Using runJobAPI action is invoked; it takes which RDD to compute & what u want to compute.
Caching => RDD.persist.storageLevel.memoryOnly or diskAlso; spark Context monitors all RDDs; BlockManager
BlockMaager => Cache data,Suffle Data,BroadCast data. For each slave, one Block Manager.Provide RDD ID & Partition(Index) as key to BlockMaager.
Spark-arcitecture: Spark-driver program,Spark-schedular=> Takes the transformations & runs on distributed computers,custom mgmt
Challenges of Spark: Multiple copy of the transformation. Solution: Spark remembers all the transformation, so it can apply multiple transformation at Once.

Search This Blog

Networking, Python, BigData and Linux

Important points on Apache spark

Comments