Important points on Apache spark
Spark Basics:
- Iterative programs (ML)
- Spark Scala(Functional programming language) has: Immutability, Lazy transformation(Execution before evaluation), type inferred, Because of Immutability, we can Cache & distribute. RDD is a big collection of data structure.
- RDD is big data collection of with properties: immutable,distributed & lazy evaluation,time Inference, resilient(Fault-tolerant) & cacheable.
- Spark Remembers all its transformation, Transformation doest apply any action. So it has multiple copies.(Bad!!!!!!!!! )
- Scala code runs on top of JVM.
- Spark-shell is interactive.
- var means immutable. but immutable value can be mapped to new value.
- Interactive Queries
- Real Time & batch processing unified
- Good use of resources (Multi-core),Network speed,disk
- Velocity is as much important as Volume.
- Realtime processing is as much important as Batch processing
- Existing map-reduce has tightly coupled with API.
- Spark makes use of Hadoop distributed storage.
- In-memory processing
- Unified Big-data processing platform. => Talk to hadoop, talk to any noSQL, talk to any Mesos/Yarn,any kind of processing
- Good for distributers, good for developers, good for users.
- All different processing platform share same abstraction i.e RDD.
- RDD can be used by ML programs .
- First version of SPARK was just 1600 lines of code; small/simple & modular program.
- In-memory processing of DNs main memory; fault tolerance of the cache data;snuffling of data also can be done in-memory
- Multi-language API like Java/Python/Scala & we can do SQL using sparkSQL.
- Partition => Logical division of data only processiong; for distributed processing (Input data,intermediate data,O/p data); basic units of parallelism. RDD is a collection of Partitions. Uses Hadoop partitioning API for dividing data. Hadoop block = spark partitions by default; but this ratio can be changed.
- Partitions are by default are immutable as spark context calls the hdfs API & and blocks are immutable because u can't change file content via. APIs
- RDD is resilient because the underneath data is also fault tolerant.
- RDD.partitions => Gives all the partitions of the RDD.
- Accessing the partition: mapPartitions(iterator)/mapPartitionWithIndex; where do u want to access all partitions => reducing operations, or u want to do sequential operations. Matrix operations on partitions.
- Find the min & maximum of a partition => read whole partition as a iterator
- Transformations of partitions will not be same as that input partition, hashPartitions(group by Key/reduce by Key). Changing the no of partitions depends up on logic i.e Key, which partition it belongs to; programming can provide the no of partitions required after shuffling. After re-partitions new RDD can be created, faster Lookup after hashPartition.
- Parent RDDs during transformation; so b4 computing its value it asks for parent RDD.Dependancy mgmt allows Lazzyness. Its like a Chain. It has to remember the transformation, subclass of RDD remembers the operation.
- Laziness transformation & later use Using runJobAPI action is invoked; it takes which RDD to compute & what u want to compute.
- Caching => RDD.persist.storageLevel.memoryOnly or diskAlso; spark Context monitors all RDDs; BlockManager
- BlockMaager => Cache data,Suffle Data,BroadCast data. For each slave, one Block Manager.Provide RDD ID & Partition(Index) as key to BlockMaager.
- Spark-arcitecture: Spark-driver program,Spark-schedular=> Takes the transformations & runs on distributed computers,custom mgmt
- Challenges of Spark: Multiple copy of the transformation. Solution: Spark remembers all the transformation, so it can apply multiple transformation at Once.
Comments