Posts

Showing posts from 2018

Automatically Create timelions/visualizations and dashboards in Kibana-6.0+ with Python

While using the metric beat with elasticsearch and Kibana for performance metrics analysis, it's really tedious to create visualisations and dashboards in Kibana. It's great to automate the stuff with Python using Kibana Rest APIs.  Here is my rough automation in python: https://github.com/Indu-sharma/timelion-dashboard-kibana-python   

PySpark : Cheat-sheet

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

Performance monitoring Tools for Big data

Performance monitoring and sizing at scale in the big data ecosystem is a real challenge.  Here are a few of the tools to use: 1.  Metric beat : Run the metric beat on each of the cluster nodes, and visualise the stats using Elasticsearch/Kibana.  https://www.elastic.co/guide/en/beats/metricbeat/current/index.html This is good for many components such as Docker, Kubernetes, KVM, Elasticsearch, Kafka, Logstash and many more components. 2.  Dr.Element: This is mainly for performance monitoring and tuning of Hadoop cluster and spark jobs: https://github.com/linkedin/dr-elephant 3. ElasticHQ/ Rally Monitor the elasticsearch Indexing and query performance at scale: http://www.elastichq.org/index.html Rally for sizing ES: https://www.elastic.co/blog/announcing-rally-benchmarking-for-elasticsearch 4. Sparklens from Qubole For profiling and sizing of spark jobs alone sparklens from Qubole is a good choice too : https://github.com/qubole/sparklens 5. Linux OS tools: You ca

ElasticSearch: Cheat-sheet

# Elasticsearch Cheatsheet - an overview of commonly used Elasticsearch API commands # cat paths /_cat/allocation /_cat/shards /_cat/shards/{index} /_cat/master /_cat/nodes /_cat/indices /_cat/indices/{index} /_cat/segments /_cat/segments/{index} /_cat/count /_cat/count/{index} /_cat/recovery /_cat/recovery/{index} /_cat/health /_cat/pending_tasks /_cat/aliases /_cat/aliases/{alias} /_cat/thread_pool /_cat/plugins /_cat/fielddata /_cat/fielddata/{fields} # Important Things bin/elasticsearch                                                       # Start Elastic instance curl -X GET  'http://localhost:9200/?pretty=true'                       # View instance metadata curl -X POST 'http://localhost:9200/_shutdown'                          # Shutdown Elastic instance curl -X GET 'http://localhost:9200/_cat?pretty=true'                    # List all admin methods curl -X GET 'http://localhost:9200/_cat/indices?pretty=true'       

Elasticsearch : How to do Performance Testing?

Step-1: First perform index performance testing by ingesting data with following settings applied: "refresh_interval" : "-1" "number_of_replicas" : 0,  "merge.scheduler.max_thread_count" : 1, "translog.flush_threshold_size" : "1024mb", "translog.durability" : "async" "thread_pool.bulk.queue_size": 1000 "bootstrap.memlockall": True Step-2: Scale data and Elastic nodes, JVM heap memory & ingest data and measure indexing performance. At T1:  curl -XGET http://localhost:9200/ /_stats/indexing?pretty=true | grep -Ei 'index_total|index_time_in_millis' At T2:   curl -XGET http://localhost:9200/ /_stats/indexing?pretty=true | grep -Ei 'index_total|index_time_in_millis' Indexing rate = 1000(index_total(at T2) - index_total(at T1)) /  (index_time_in_millis(at T2) - index_time_in_millis(at T1)) Step-3: Use benchmarking tools such as Rally https://esrall

Security Compliance and policies

A. Organisation Level 1.  Service Organisation Control(SOC-1/2 Type-I/II) https://www.netgainit.com/soc-2-type-ii-certification-defined/ 2. General Data protection Regulation Requirements(GDPR) https://www.csoonline.com/article/3202771/data-protection/general-data-protection-regulation-gdpr-requirements-deadlines-and-facts.html 3.  HIPAA (Health Insurance Portability and Accountability Act) https://searchhealthit.techtarget.com/definition/HIPAA 4. NIST(National Institute of Standards and Technology) https://digitalguardian.com/blog/what-nist-compliance 5. (STAR)Security, Trust & Assurance Registry https://cloudsecurityalliance.org/star/#_overview 6. CSA(Cloud Security Alliance) https://www.cloudsecurityalliance.org/csaguide.pdf 7. PCI(Payment Card Industry) https://www.pcisecuritystandards.org/ 8. SOX(Sarbanes-Oxley Act ) https://www.blackstratus.com/sox-compliance-requirements/ 9. ISO27001 ISMS http://www.iso27001security.com/html/toolkit.html

Exploit-DB

https://www.exploit-db.com/exploits/34272/ 3NczQvTgkFQJkpsSVeEAW_6qaBPA3752ZKLpxnrdLdg

How do you substitute the Nth by a given regular expression pattern in python?

I had a need to replace Nth IP address in logs by a given IP address. This is how achieved in pythonic way. It may be useful to you as well. import re mystr = '203.23.48.0 DENIED 302 449 800 1.1 302 http d.flashresultats.fr 10.111.103.202 GET GET - 188.92.40.78 ' src = '1.1.1.1' replace_nth = lambda mystr, pattern, sub, n: re.sub(re.findall(pattern, mystr)[n - 1 ], sub, mystr) mystr = replace_nth(mystr, '\S*\d+\.\d+\.\d+\.\d+\S*' , src, 2 ) print (mystr) # It outputs:203.23.48.0 DENIED 302 449 800 1.1 302 http d.flashresultats.fr 1.1.1.1 GET GET - 188.92.40.78

Python : All About Enumeration Object

from enum import IntEnum, Enum, unique import sys sys.stdout = open(__name__, 'w') class CountryCode1(Enum): Nepal = 977 India = 91 Pakistan = 92 Bangaladesh = 90 Bhutan = 93 Bhutan = 90 class CountryCode2(IntEnum): Nepal = 977 India = 91 Pakistan = 92 Bangaladesh = 90 Bhutan = 93 Bhutan = 90 def test_countryCode_1(): for i in CountryCode1: print ( "With enum Inherited from Enum => Country : {}, Code : {}" .format(i.name, i.value)) def test_countryCode_2(): try : for i in sorted (CountryCode1): print ( "With enum Inherited from Enum & Trying to Sort Enum Object => Country : {}, Code : {}" .format(i.name, i.value)) except Exception as e: print ( "With enum Inherited from Enum & Trying to Sort Enum Object: {}" .format(e)) de

Python: All about generators

Generators are very powerful in Python. Its created using 'yield' within the function. Here are three most important variants, where 'yield' makes a function a generator, a coroutine or  as a context manager: [Will explain each when i get time :) ] A. Using 'Yield' as a generator in function: def mygen(a=0, b=1):   while True:        yield b        a, b = b, a + b  c = mygen()  for i in range(10): print(c.next()) B. Using 'Yield' as co-routine: def mycoroutine():       v_count = 0 inv_count = 0      try:           while True:                   myinput = yield                  if isinstance(myinput, int):                          v_count = v_count + 1                  else:                          inv_count = inv_count + 1        except GeneratorExit:               print("You Sent {} valid digits and {} invalid chars:".format(v_count,inv_count))   mygen = mycoroutine()  mygen.next() for i in range(

Python: Method Resolution Order(MRO) in Inheritance

Lets consider the following python code: class A: def do(self): print("From the class: A") class B(A): def do(self): print("From the class: B") class C(A): def do(self): print("From the class: C") class D(C): pass class E(D, B): pass class F(E): pass What would be the output of following? inst= F() inst.do() Well, to understand this we should know the method resolution order in python's inherited classes. In Python 2.x, the resolution algorithm is: Search Deep first & then Left to Right order. That is In above example, the Order would be F->E->D->C->A->B-A So the output will be : From the class: C However, In Python 3.x, the resolution algorithm is: Left to Right first & Deep. That is in above example the resolution order will be : F->E->D->B->C->A So the output will be : From the class: B For Details, refer to : https://en.wikipedia.org/wiki/C3_linearization

Python : Memory Management and debugging

Memory Management in Python: 'Important points' Dynamic Allocation(on Heap Memory) -> Objects and values  Stack Memory => variables and Methods/Functions Everything in Python is object.  Python always uses heap to store the Ints, String values etc unlike the C. It maintains the reference counts if multiple variables are pointing to the same Objects. However for weak-refs the refcount is not incremented. Once the Reference count(sys.refcount=0) reaches zero python does Automatic Garbage collection immediately.  Refcounts are not thread safe.  The main reason GIL comes into play.  In Class, we can use __slot__ = ('x','y') etc to make the class immutable i.e not to allow to have new attributes/methods. In Complex data structure such as doubly LinkedList, DeQueue, Trees; in some cases due to cyclical references, Ref counts never reaches to Zero. In that case, Python uses Mark & sweep(Java Uses) Algorithm periodically by marking the only objec

Linux OS : Boot Process(BIOS, MBR, GRUB,Karnel, Init, Runlevel)

What happens during Linux boot process ? Read here: https://www.thegeekstuff.com/2011/02/linux-boot-process What happens during the Login by  Non-privileged user post reboot? Init process (pid=1, uid =0 ) spwans the Login process  (pid-x,uid=0)   Fork() system call   Exec system call Login Process  Fork() System Call setuid for the User (pid x, uid=0) Exec System call  Shell Uid >0, PID x Execute ~/.bashrc ; exports all the enviornment variables etc/

Exception Handling :: Python

Python has powerful exception handling capabilities. There are few exception which are present in BaseException class, and many exceptions are derived from the class Exception which is the child class of BaseException. All the user defined exceptions has to be derived from the class class Exception or its child. Let's see exception SystemExit direct child of BaseException in action: import sys try: sys.exit() except Exception as e: pass print(e) In above case, the program exits regardless of Exception being caught. That is because, System exit exception is derived from BaseException not Exception. However, in following case; the exception is caught as expected: import sys try: sys.exit() except BaseException as e: pass print(e) # o/p -> 'SystemExit' Lets' see few awful things that can go wrong in Exceptions try/catch: try: raise "Value Error" except "Value Error": pass print("UnCaught!") # TypeError: exce

Behind the scene: Memory Address allocation and Copy/DeepCopy in Python with List

Lets say, in python list a= [ 1 , 1 , 2 ] . Ever wondered how the memory allocation would happen in Python for the list elements ? In Python, the variables are tagged, meaning the elements of above list 1,1 would be stored in the same location and 2 is stored in different location. when we print the memory location of a i.e id(a) and  memory location of each elements , we get the results. (4315182216, 140602806130168, 1) (4315182216, 140602806130168, 1) (4315182216, 140602806130144, 2) As seen above the memory address of a = 4315182216, memory address of 1 is 140602806130168 for first and second elements. Now, to see more understanding and impacts of above such memory allocation, we see the various ways of copying the list, and check the impact on copied list due to modification on the original list variable or list elements. ::Without using Copy Module::: i.e when b = a ----------------------------------------Memory addresses of List a, before modification on a:------

Serialisation and DeSerialisation of python Objects

Serialisation and DeSerialisation a.k.a SerDe are the important aspects of streaming data into file from objects and from the file to objects. There are many ways in Pythons to do so. Few of the tools are PyYaml, JSON and cPickle. However, cPickle is the fastest as it can SerDe any python objects such as Strings/texts, Lists, Dicts, Class obj in the byte formats. One thing to worry about the SerDe such as cPickle is that, the security is a great concern if we store the objects information as byte stream into the file. Hence, cPickle should be used carefully with the highest possible protocol.Following Example, demonstrates the serDe with cPickle for Python class Objects. import cPickle  class acc:     def __init__(self, id, bal):    self.id = id     self.bal = bal    def dep(self, amount):       self.bal += amount   def withdraw(self, amount):      self.bal -= amount ac = acc('1988999', 200)  ac.dep(1000)  ac.withdraw(500)  fd = open("ty2&q

How can u invoke all the methods of a class at one go without explicit calling each?

# Lets define a class with some methods as: class Nums: def __init__ ( self , n): self .val = n def multwo( self ): self .val *= 2 def addtwo( self ): self .val += 2 def devidetwo( self ): self .val /= 2 # Lets create the instance of the class: f = Nums( 2 ) Now, call each and every methods as in dir(f) list. You should however, take care of methods taking arguments. Example, for init we need to pass args. for m in dir (f): mymethod = getattr (f, m) if callable (mymethod): if 'init' not in str (mymethod): mymethod() else : pass print (f.val) => It results 4. Wonder why ?  To figure out, lets see in which order the methods within the class are invoked: This is how the instance stores the methods and properties lists: ['__doc__', '__init__', '__module__', 'addtwo', 'devidetwo', 'multwo', 'val'] So, val = (2+2)/

Preventing and enforcing the override in derived classes in Python

In Python, there is nothing like data hiding. However, we can have representation of data hiding with _(single Underscore). __(Double Underscore) Means the methods and properties can't be overridden in derived classes. Trying to do show causes exceptions. @abstractmethod   Decorator   on any methods mean, that should be   overridden for sure in derived class. Not implementing that method causes exceptions. __xyz__ (Double Underscore prefix and suffix) Means the method/attribute is an builtin, can be overridden.  Example:  from abc import ABC, abstractmethod class Base(ABC):       def __init__(self):                   self.result = self._private()       def  _private (self):         return "Yes"       @ abstractmethod        def you_should_override(self):             pass      def __you_cant_override(self):             pass class Derived(Base):            def you_should_override(self):             pass        

Data Models in Python Using LinkedList example

P ython data models are important aspects in designing the objects.   Python data models helps you to implement how your objects should behave to the user. Example: By Default, print(object) prints non-useful o/p, we can make it to produce something meaningful to the user.   This experiment, tries to implement few DataModels for LinkList object in Python such as iter(Makes linkedList Object iterable), eq(Compares Two LinkedLists on  == operator), add(Adds Elements of two LinkedLists), str(Prints the elements of LinkList on printing the object), repr(Display the representation of LinkedList object on repr method) etc. For Implementation details, please visit the git Link: https://github.com/Indu-sharma/Utilities/tree/master/Python_generic/PythonDataModels

Malacious & Good Urls List

Malacious Urls: textspeier.de/ 104.27.163.228 photoscape.ch/Setup.exe 31.148.219.11 sarahdaniella.com/swift/SWIFT%20$.pdf.ace 63.247.140.224 amazon-sicherheit.kunden-ueberpruefung.xyz 185.61.138.74 alegroup.info/ntnrrhst 194.87.217.87 fourthgate.org/Yryzvt 104.200.67.194 dieutribenhkhop.com/parking/ 84.200.4.125 dieutribenhkhop.com/parking/pay/rd.php?id=10 84.200.4.125 ssl-6582datamanager.de/ 54.72.9.51 privatkunden.datapipe9271.com/ 104.31.75.147 www.hjaoopoa.top/admin.php?f=1.gif 52.207.234.89 up.mykings.pw:8888/update.txt 60.250.76.52 down.mykings.pw:8888/ver.txt 60.250.76.52 down.mykings.pw:8888/ups.rar 60.250.76.52 fo5.a1-downloader.org/g2v9s1.php?id=yourname@yourdomain.com 188.225.32.177 falconsafe.com.sg/api/get.php?id=aW5mb0BzYXBjdXBncmFkZXMuY29t 43.229.84.107 www.lifelabs.vn/api/get.php?id=aW5mb0BzYXBjdXBncmFkZXMuY29t 118.69.196.199 61kx.uk-insolvencydirect.com/sending_data/in_cgi/bbwp/cases/Inquiry.php 35.166.113.223 daralasnan.com/wp-content/pl

DNS Malacious traffic simulation on Kali Linux

1> Spoof DNS Traffic : https://packetstormsecurity.com/files/10080/ADMid-pkg.tgz.html https://null-byte.wonderhowto.com/how-to/hack-like-pro-spoof-dns-lan-redirect-traffic-your-fake-website-0151620/

Using PySpark to demonstrate spark Transformation and actions on RDDs and stages/DAG evaluation-Part3(-How to avoid Shuffle Joins in PySpark)

Image
In this part, lets demonstrate the spark transformations and actions on Multiple RDDs using pySpark. Lets consider two datasets as: dataset1=["10.1.0.1,1000","10.1.0.2,2000","10.1.0.3,3000"]*1000 dataset2=["10.1.0.1,hostname1","10.1.0.2,hostname2","10.1.0.3,hostname3"] I wish to get the final o/p as following by aggregating the Bytes on 1st dataset incase of duplicate records. 10.1.0.1,hostname1,100000 Approach-1: 1st aggregate the dataset1 by IP address, save as  RDD , next create an RDD from 2nd dataset and apply join on two RDDs. dataset1_rdd=sc.parallelize(dataset1).map(lambda x: x.split(",")).mapValues(lambda x: int(x)).reduceByKey(lambda x, y:x+y) dataset2_rdd=sc.parallelize(dataset2).map(lambda x: tuple(x.split(","))) for i in dataset1_rdd.join(dataset2_rdd).map(lambda x:str(x[0])+","+str(x[1][1])+","+str(x[1][0])).collect(): print i 10.1.0.1,hostname1,1000

Using PySpark to demonstrate spark Transformation and actions on RDDs and stages/DAG evaluation-Part2

Image
In this part, we consider the unstructured texts to demonstrate the spark transformations and action using pySpark. file:///data/text.csv rdd=sc.textFile("file:///data/text.csv") myrdd.take(2) [u'Think of it for a moment \u2013 1 Qunitillion = 1 Million Billion! Can you imagine how many drives / CDs / Blue-ray DVDs would be required to store them? It is difficult to imagine this scale of data generation even as a data science professional. While this pace of data generation is very exciting,  it has created entirely new set of challenges and has forced us to find new ways to handle Big Huge data effectively.', u''] Q1: Convert all words in a rdd to lowercase and split the lines of a document using space. myrdd.map(lambda x: x.lower().split()).take(2) [[u'think', u'of', u'it', u'for', u'a', u'moment', u'\u2013', u'1', u'qunitillion', u'=', u'1', u'milli

Using PySpark to demonstrate spark Transformation and actions on RDDs and stages/DAG evaluation-Part1

Image
Through the experiment, we will use following connections information: /usr/bin/pyspark logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.INFO ) file: connections.csv srcIp,dstIp,srcPort,dstPort,protocol,bytes 1.1.1.1,10.10.10.10,11111,22,tcp,1000 1.1.1.1,10.10.10.10,22222,69,udp,2000 2.2.2.2,20.20.20.20,33333,21,tcp,3000 2.2.2.2,30.30.30.30,44444,69,udp,4000 3.3.3.3,30.30.30.30,44444,22,tcp,5000 4.4.4.4,40.40.40.40,55555,25,tcp,6000 5.5.5.5,50.50.50.50,66666,161,udp,7000 6.6.6.6,60.60.60.60,77777,162,tcp,8000 Q1. Find the Sum of  Bytes sent by each srcIp? headers=sc.textFile("file:///data/connections.csv").first() filtered_rdd = sc.textFile("file:///data/connections.csv").filter(lambda x: x!=headers and x.strip()) //Tranformations -> filter , No of Stages->1 for i in filtered_rdd.map(lambda x: (x.split(",")[0], int(x.split(",")[-1]))).reduceByKey(lambda x