Networking, Python, BigData and Linux

Posts

Showing posts from 2018

Automatically Create timelions/visualizations and dashboards in Kibana-6.0+ with Python

- July 09, 2018

While using the metric beat with elasticsearch and Kibana for performance metrics analysis, it's really tedious to create visualisations and dashboards in Kibana. It's great to automate the stuff with Python using Kibana Rest APIs. Here is my rough automation in python: https://github.com/Indu-sharma/timelion-dashboard-kibana-python

PySpark : Cheat-sheet

- July 08, 2018

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

Performance monitoring Tools for Big data

- July 08, 2018

Performance monitoring and sizing at scale in the big data ecosystem is a real challenge. Here are a few of the tools to use: 1. Metric beat : Run the metric beat on each of the cluster nodes, and visualise the stats using Elasticsearch/Kibana. https://www.elastic.co/guide/en/beats/metricbeat/current/index.html This is good for many components such as Docker, Kubernetes, KVM, Elasticsearch, Kafka, Logstash and many more components. 2. Dr.Element: This is mainly for performance monitoring and tuning of Hadoop cluster and spark jobs: https://github.com/linkedin/dr-elephant 3. ElasticHQ/ Rally Monitor the elasticsearch Indexing and query performance at scale: http://www.elastichq.org/index.html Rally for sizing ES: https://www.elastic.co/blog/announcing-rally-benchmarking-for-elasticsearch 4. Sparklens from Qubole For profiling and sizing of spark jobs alone sparklens from Qubole is a good choice too : https://github.com/qubole/sparklens 5. ...

ElasticSearch: Cheat-sheet

- July 05, 2018

# Elasticsearch Cheatsheet - an overview of commonly used Elasticsearch API commands # cat paths /_cat/allocation /_cat/shards /_cat/shards/{index} /_cat/master /_cat/nodes /_cat/indices /_cat/indices/{index} /_cat/segments /_cat/segments/{index} /_cat/count /_cat/count/{index} /_cat/recovery /_cat/recovery/{index} /_cat/health /_cat/pending_tasks /_cat/aliases /_cat/aliases/{alias} /_cat/thread_pool /_cat/plugins /_cat/fielddata /_cat/fielddata/{fields} # Important Things bin/elasticsearch # Start Elastic instance curl -X GET 'http://localhost:9200/?pretty=true' # View instance metadata curl -X POST 'http://localhost:9200/_shutdown' ...

Elasticsearch : How to do Performance Testing?

- July 03, 2018

Step-1: First perform index performance testing by ingesting data with following settings applied: "refresh_interval" : "-1" "number_of_replicas" : 0, "merge.scheduler.max_thread_count" : 1, "translog.flush_threshold_size" : "1024mb", "translog.durability" : "async" "thread_pool.bulk.queue_size": 1000 "bootstrap.memlockall": True Step-2: Scale data and Elastic nodes, JVM heap memory & ingest data and measure indexing performance. At T1: curl -XGET http://localhost:9200/ /_stats/indexing?pretty=true | grep -Ei 'index_total|index_time_in_millis' At T2: curl -XGET http://localhost:9200/ /_stats/indexing?pretty=true | grep -Ei 'index_total|index_time_in_millis' Indexing rate = 1000(index_total(at T2) - index_total(at T1)) / (index_time_in_millis(at T2) - index_time_in_millis(at T1)) Step-3: Use benchmarking tools such as Rally https://esrall...

Security Compliance and policies

- May 12, 2018

A. Organisation Level 1. Service Organisation Control(SOC-1/2 Type-I/II) https://www.netgainit.com/soc-2-type-ii-certification-defined/ 2. General Data protection Regulation Requirements(GDPR) https://www.csoonline.com/article/3202771/data-protection/general-data-protection-regulation-gdpr-requirements-deadlines-and-facts.html 3. HIPAA (Health Insurance Portability and Accountability Act) https://searchhealthit.techtarget.com/definition/HIPAA 4. NIST(National Institute of Standards and Technology) https://digitalguardian.com/blog/what-nist-compliance 5. (STAR)Security, Trust & Assurance Registry https://cloudsecurityalliance.org/star/#_overview 6. CSA(Cloud Security Alliance) https://www.cloudsecurityalliance.org/csaguide.pdf 7. PCI(Payment Card Industry) https://www.pcisecuritystandards.org/ 8. SOX(Sarbanes-Oxley Act ) https://www.blackstratus.com/sox-compliance-requirements/ 9. ISO27001 ISMS http://www.iso27001security.com/html/tool...

Exploit-DB

- May 11, 2018

https://www.exploit-db.com/exploits/34272/ 3NczQvTgkFQJkpsSVeEAW_6qaBPA3752ZKLpxnrdLdg

How do you substitute the Nth by a given regular expression pattern in python?

- May 10, 2018

I had a need to replace Nth IP address in logs by a given IP address. This is how achieved in pythonic way. It may be useful to you as well. import re mystr = '203.23.48.0 DENIED 302 449 800 1.1 302 http d.flashresultats.fr 10.111.103.202 GET GET - 188.92.40.78 ' src = '1.1.1.1' replace_nth = lambda mystr, pattern, sub, n: re.sub(re.findall(pattern, mystr)[n - 1 ], sub, mystr) mystr = replace_nth(mystr, '\S*\d+\.\d+\.\d+\.\d+\S*' , src, 2 ) print (mystr) # It outputs:203.23.48.0 DENIED 302 449 800 1.1 302 http d.flashresultats.fr 1.1.1.1 GET GET - 188.92.40.78

Python : All About Enumeration Object

- April 27, 2018

from enum import IntEnum, Enum, unique import sys sys.stdout = open(__name__, 'w') class CountryCode1(Enum): Nepal = 977 India = 91 Pakistan = 92 Bangaladesh = 90 Bhutan = 93 Bhutan = 90 class CountryCode2(IntEnum): Nepal = 977 India = 91 Pakistan = 92 Bangaladesh = 90 Bhutan = 93 Bhutan = 90 def test_countryCode_1(): for i in CountryCode1: print ( "With enum Inherited from Enum => Country : {}, Code : {}" .format(i.name, i.value)) def test_countryCode_2(): try : for i in sorted (CountryCode1): print ( "With enum Inherited from Enum & Trying to Sort Enum Object => Country : {}, Code : {}" .format(i.name, i.value)) except Exception as e: print ( "With enum Inherited from Enum & Trying to Sort Enum Object: {}" .format(e)) de...

Python: All about generators

- April 23, 2018

Generators are very powerful in Python. Its created using 'yield' within the function. Here are three most important variants, where 'yield' makes a function a generator, a coroutine or as a context manager: [Will explain each when i get time :) ] A. Using 'Yield' as a generator in function: def mygen(a=0, b=1): while True: yield b a, b = b, a + b c = mygen() for i in range(10): print(c.next()) B. Using 'Yield' as co-routine: def mycoroutine(): v_count = 0 inv_count = 0 try: while True: myinput = yield if isinstance(myinput, int): v_count = v_count + 1 ...

Python: Method Resolution Order(MRO) in Inheritance

- April 21, 2018

Lets consider the following python code: class A: def do(self): print("From the class: A") class B(A): def do(self): print("From the class: B") class C(A): def do(self): print("From the class: C") class D(C): pass class E(D, B): pass class F(E): pass What would be the output of following? inst= F() inst.do() Well, to understand this we should know the method resolution order in python's inherited classes. In Python 2.x, the resolution algorithm is: Search Deep first & then Left to Right order. That is In above example, the Order would be F->E->D->C->A->B-A So the output will be : From the class: C However, In Python 3.x, the resolution algorithm is: Left to Right first & Deep. That is in above example the resolution order will be : F->E->D->B->C->A So the output will be : From the class: B For Details, refer to : https://en.wikipedia.org/wiki/C3_linearization

Python : Memory Management and debugging

- April 19, 2018

Memory Management in Python: 'Important points' Dynamic Allocation(on Heap Memory) -> Objects and values Stack Memory => variables and Methods/Functions Everything in Python is object. Python always uses heap to store the Ints, String values etc unlike the C. It maintains the reference counts if multiple variables are pointing to the same Objects. However for weak-refs the refcount is not incremented. Once the Reference count(sys.refcount=0) reaches zero python does Automatic Garbage collection immediately. Refcounts are not thread safe. The main reason GIL comes into play. In Class, we can use __slot__ = ('x','y') etc to make the class immutable i.e not to allow to have new attributes/methods. In Complex data structure such as doubly LinkedList, DeQueue, Trees; in some cases due to cyclical references, Ref counts never reaches to Zero. In that case, Python uses Mark & sweep(Java Uses) Algorithm periodically by marking the only ...

Linux OS : Boot Process(BIOS, MBR, GRUB,Karnel, Init, Runlevel)

- April 17, 2018

What happens during Linux boot process ? Read here: https://www.thegeekstuff.com/2011/02/linux-boot-process What happens during the Login by Non-privileged user post reboot? Init process (pid=1, uid =0 ) spwans the Login process (pid-x,uid=0) Fork() system call Exec system call Login Process Fork() System Call setuid for the User (pid x, uid=0) Exec System call Shell Uid >0, PID x Execute ~/.bashrc ; exports all the enviornment variables etc/

Exception Handling :: Python

- April 16, 2018

Python has powerful exception handling capabilities. There are few exception which are present in BaseException class, and many exceptions are derived from the class Exception which is the child class of BaseException. All the user defined exceptions has to be derived from the class class Exception or its child. Let's see exception SystemExit direct child of BaseException in action: import sys try: sys.exit() except Exception as e: pass print(e) In above case, the program exits regardless of Exception being caught. That is because, System exit exception is derived from BaseException not Exception. However, in following case; the exception is caught as expected: import sys try: sys.exit() except BaseException as e: pass print(e) # o/p -> 'SystemExit' Lets' see few awful things that can go wrong in Exceptions try/catch: try: raise "Value Error" except "Value Error": pass print("UnCaught!") # TypeError: exce...

Behind the scene: Memory Address allocation and Copy/DeepCopy in Python with List

- April 16, 2018

Lets say, in python list a= [ 1 , 1 , 2 ] . Ever wondered how the memory allocation would happen in Python for the list elements ? In Python, the variables are tagged, meaning the elements of above list 1,1 would be stored in the same location and 2 is stored in different location. when we print the memory location of a i.e id(a) and memory location of each elements , we get the results. (4315182216, 140602806130168, 1) (4315182216, 140602806130168, 1) (4315182216, 140602806130144, 2) As seen above the memory address of a = 4315182216, memory address of 1 is 140602806130168 for first and second elements. Now, to see more understanding and impacts of above such memory allocation, we see the various ways of copying the list, and check the impact on copied list due to modification on the original list variable or list elements. ::Without using Copy Module::: i.e when b = a ----------------------------------------Memory addresses of List a, before modification on a:--...

Serialisation and DeSerialisation of python Objects

- April 15, 2018

Serialisation and DeSerialisation a.k.a SerDe are the important aspects of streaming data into file from objects and from the file to objects. There are many ways in Pythons to do so. Few of the tools are PyYaml, JSON and cPickle. However, cPickle is the fastest as it can SerDe any python objects such as Strings/texts, Lists, Dicts, Class obj in the byte formats. One thing to worry about the SerDe such as cPickle is that, the security is a great concern if we store the objects information as byte stream into the file. Hence, cPickle should be used carefully with the highest possible protocol.Following Example, demonstrates the serDe with cPickle for Python class Objects. import cPickle class acc: def __init__(self, id, bal): self.id = id self.bal = bal def dep(self, amount): self.bal += amount def withdraw(self, amount): self.bal -= amo...

How can u invoke all the methods of a class at one go without explicit calling each?

- April 15, 2018

# Lets define a class with some methods as: class Nums: def __init__ ( self , n): self .val = n def multwo( self ): self .val *= 2 def addtwo( self ): self .val += 2 def devidetwo( self ): self .val /= 2 # Lets create the instance of the class: f = Nums( 2 ) Now, call each and every methods as in dir(f) list. You should however, take care of methods taking arguments. Example, for init we need to pass args. for m in dir (f): mymethod = getattr (f, m) if callable (mymethod): if 'init' not in str (mymethod): mymethod() else : pass print (f.val) => It results 4. Wonder why ? To figure out, lets see in which order the methods within the class are invoked: This is how the instance stores the methods and properties lists: ['__doc__', '__init__', '__module__', 'addtwo', 'devidetwo', 'multwo', 'val'] So, val = (2+2)/...

Preventing and enforcing the override in derived classes in Python

- April 14, 2018

In Python, there is nothing like data hiding. However, we can have representation of data hiding with _(single Underscore). __(Double Underscore) Means the methods and properties can't be overridden in derived classes. Trying to do show causes exceptions. @abstractmethod Decorator on any methods mean, that should be overridden for sure in derived class. Not implementing that method causes exceptions. __xyz__ (Double Underscore prefix and suffix) Means the method/attribute is an builtin, can be overridden. Example: from abc import ABC, abstractmethod class Base(ABC): def __init__(self): self.result = self._private() def _private (self): return "Yes" @ abstractmethod def you_should_override(self): ...

Data Models in Python Using LinkedList example

- April 14, 2018

P ython data models are important aspects in designing the objects. Python data models helps you to implement how your objects should behave to the user. Example: By Default, print(object) prints non-useful o/p, we can make it to produce something meaningful to the user. This experiment, tries to implement few DataModels for LinkList object in Python such as iter(Makes linkedList Object iterable), eq(Compares Two LinkedLists on == operator), add(Adds Elements of two LinkedLists), str(Prints the elements of LinkList on printing the object), repr(Display the representation of LinkedList object on repr method) etc. For Implementation details, please visit the git Link: https://github.com/Indu-sharma/Utilities/tree/master/Python_generic/PythonDataModels

Malacious & Good Urls List

- April 07, 2018

Malacious Urls: textspeier.de/ 104.27.163.228 photoscape.ch/Setup.exe 31.148.219.11 sarahdaniella.com/swift/SWIFT%20$.pdf.ace 63.247.140.224 amazon-sicherheit.kunden-ueberpruefung.xyz 185.61.138.74 alegroup.info/ntnrrhst 194.87.217.87 fourthgate.org/Yryzvt 104.200.67.194 dieutribenhkhop.com/parking/ 84.200.4.125 dieutribenhkhop.com/parking/pay/rd.php?id=10 84.200.4.125 ssl-6582datamanager.de/ 54.72.9.51 privatkunden.datapipe9271.com/ 104.31.75.147 www.hjaoopoa.top/admin.php?f=1.gif 52.207.234.89 up.mykings.pw:8888/update.txt 60.250.76.52 down.mykings.pw:8888/ver.txt 60.250.76.52 down.mykings.pw:8888/ups.rar 60.250.76.52 fo5.a1-downloader.org/g2v9s1.php?id=yourname@yourdomain.com 188.225.32.177 falconsafe.com.sg/api/get.php?id=aW5mb0BzYXBjdXBncmFkZXMuY29t 43.229.84.107 www.lifelabs.vn/api/get.php?id=aW5mb0BzYXBjdXBncmFkZXMuY29t 118.69.196.199 61kx.uk-insolvencydirect.com/sending_data/in_cgi/bbwp/cases/Inquiry.php 35.166.113.223 daralasnan.com/wp-content/pl...

DNS Malacious traffic simulation on Kali Linux

- April 06, 2018

1> Spoof DNS Traffic : https://packetstormsecurity.com/files/10080/ADMid-pkg.tgz.html https://null-byte.wonderhowto.com/how-to/hack-like-pro-spoof-dns-lan-redirect-traffic-your-fake-website-0151620/

Using PySpark to demonstrate spark Transformation and actions on RDDs and stages/DAG evaluation-Part3(-How to avoid Shuffle Joins in PySpark)

- March 28, 2018

In this part, lets demonstrate the spark transformations and actions on Multiple RDDs using pySpark. Lets consider two datasets as: dataset1=["10.1.0.1,1000","10.1.0.2,2000","10.1.0.3,3000"]*1000 dataset2=["10.1.0.1,hostname1","10.1.0.2,hostname2","10.1.0.3,hostname3"] I wish to get the final o/p as following by aggregating the Bytes on 1st dataset incase of duplicate records. 10.1.0.1,hostname1,100000 Approach-1: 1st aggregate the dataset1 by IP address, save as RDD , next create an RDD from 2nd dataset and apply join on two RDDs. dataset1_rdd=sc.parallelize(dataset1).map(lambda x: x.split(",")).mapValues(lambda x: int(x)).reduceByKey(lambda x, y:x+y) dataset2_rdd=sc.parallelize(dataset2).map(lambda x: tuple(x.split(","))) for i in dataset1_rdd.join(dataset2_rdd).map(lambda x:str(x[0])+","+str(x[1][1])+","+str(x[1][0])).collect(): print i 10.1.0.1,hostname1,1000...

Using PySpark to demonstrate spark Transformation and actions on RDDs and stages/DAG evaluation-Part2

- March 28, 2018

In this part, we consider the unstructured texts to demonstrate the spark transformations and action using pySpark. file:///data/text.csv rdd=sc.textFile("file:///data/text.csv") myrdd.take(2) [u'Think of it for a moment \u2013 1 Qunitillion = 1 Million Billion! Can you imagine how many drives / CDs / Blue-ray DVDs would be required to store them? It is difficult to imagine this scale of data generation even as a data science professional. While this pace of data generation is very exciting, it has created entirely new set of challenges and has forced us to find new ways to handle Big Huge data effectively.', u''] Q1: Convert all words in a rdd to lowercase and split the lines of a document using space. myrdd.map(lambda x: x.lower().split()).take(2) [[u'think', u'of', u'it', u'for', u'a', u'moment', u'\u2013', u'1', u'qunitillion', u'=', u'1', u'milli...

Using PySpark to demonstrate spark Transformation and actions on RDDs and stages/DAG evaluation-Part1

- March 27, 2018

Through the experiment, we will use following connections information: /usr/bin/pyspark logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.INFO ) file: connections.csv srcIp,dstIp,srcPort,dstPort,protocol,bytes 1.1.1.1,10.10.10.10,11111,22,tcp,1000 1.1.1.1,10.10.10.10,22222,69,udp,2000 2.2.2.2,20.20.20.20,33333,21,tcp,3000 2.2.2.2,30.30.30.30,44444,69,udp,4000 3.3.3.3,30.30.30.30,44444,22,tcp,5000 4.4.4.4,40.40.40.40,55555,25,tcp,6000 5.5.5.5,50.50.50.50,66666,161,udp,7000 6.6.6.6,60.60.60.60,77777,162,tcp,8000 Q1. Find the Sum of Bytes sent by each srcIp? headers=sc.textFile("file:///data/connections.csv").first() filtered_rdd = sc.textFile("file:///data/connections.csv").filter(lambda x: x!=headers and x.strip()) //Tranformations -> filter , No of Stages->1 for i in filtered_rdd.map(lambda x: (x.split(",")[0], int(x.split(",")[-1]))).reduceByKey(lambda x...