Important points on Apache spark

Spark Basics:

  1. Iterative programs (ML)
  2. Spark Scala(Functional programming language) has: Immutability, Lazy transformation(Execution before evaluation), type inferred, Because of Immutability, we can Cache & distribute. RDD is a big collection of data structure.
  3. RDD is big data collection of with properties: immutable,distributed & lazy evaluation,time Inference, resilient(Fault-tolerant) & cacheable.
  4. Spark Remembers all its transformation, Transformation doest apply any action. So it has multiple copies.(Bad!!!!!!!!! )
  5. Scala code runs on top of JVM.
  6. Spark-shell is interactive. 
  7. var means immutable. but immutable value can be mapped to new value.
  8. Interactive Queries
  9. Real Time & batch processing unified
  10. Good use of resources (Multi-core),Network speed,disk
  11. Velocity is as much important as Volume.
  12. Realtime processing is as much important as Batch processing
  13. Existing map-reduce has tightly coupled with API.
  14. Spark makes use of Hadoop distributed storage.
  15. In-memory processing
  16. Unified Big-data processing platform. => Talk to hadoop, talk to any noSQL, talk to any  Mesos/Yarn,any kind of processing
  17. Good for distributers, good for developers, good for users.
  18. All different processing platform share same abstraction i.e RDD.
  19. RDD can be used by ML programs .
  20. First version of SPARK was just 1600 lines of code; small/simple & modular program.
  21. In-memory processing of DNs main memory; fault tolerance of the cache data;snuffling of data also can be done in-memory
  22. Multi-language API like Java/Python/Scala & we can do SQL using sparkSQL.
  23. Partition => Logical division of data only processiong; for distributed processing (Input data,intermediate data,O/p data); basic units of parallelism. RDD is a collection of Partitions. Uses Hadoop partitioning API for dividing data. Hadoop block = spark partitions by default; but this ratio can be changed. 
  24. Partitions are by default are immutable as spark context calls the hdfs API & and blocks are immutable because u can't change file content via. APIs
  25. RDD is resilient because the underneath data is also fault tolerant.
  26. RDD.partitions => Gives all the partitions of the RDD.
  27. Accessing the partition:  mapPartitions(iterator)/mapPartitionWithIndex; where do u want to access all partitions => reducing operations, or u want to do sequential operations. Matrix operations on partitions.
  28. Find  the min & maximum of a partition => read whole partition as a iterator
  29. Transformations of partitions will not be same as that input partition, hashPartitions(group by Key/reduce by Key). Changing the no of partitions depends up on logic i.e Key, which partition it belongs to; programming can provide the no of partitions required after shuffling. After re-partitions new RDD can be created, faster Lookup after hashPartition. 
  30. Parent RDDs during transformation; so b4 computing its value it asks for parent RDD.Dependancy mgmt allows Lazzyness. Its like a Chain. It has to remember the transformation, subclass of RDD remembers the operation. 
  31. Laziness  transformation & later use Using runJobAPI action is invoked; it takes which RDD to compute & what u want to compute.
  32. Caching => RDD.persist.storageLevel.memoryOnly or diskAlso; spark Context monitors all RDDs; BlockManager
  33. BlockMaager => Cache data,Suffle Data,BroadCast data. For each slave, one Block Manager.Provide RDD ID & Partition(Index) as key to BlockMaager.
  34. Spark-arcitecture: Spark-driver program,Spark-schedular=> Takes the transformations & runs on distributed computers,custom mgmt
  35. Challenges of Spark: Multiple copy of the transformation. Solution: Spark remembers all the transformation, so it can apply multiple transformation at Once.

Comments