Basic Workflows/Algorithms for machine Learning in Text Processing
Web Extraction: Apache Tika (or Any Custom Crawler) Concept Extraction Sentence Detection (maxent-3.0.0.jar, opennlp-tools-1.5.3.jar) NER(Named Entity Recognition) Concept Extraction (snowball-stemmer-1.3.0.581.1.jar) Multi phrase to list phrase Concept Filtering Zipf Filtering ChiSq Filtering Low frequency Filtering Signal Filtering String Indexer: StringIndexer (spark-mlib_2.10-2.1.0.jar) Hashing TF HashingTF (spark-mlib_2.10-2.1.0.jar) IDF (spark-mlib_2.10-2.1.0.jar) Classifier for Supervised Classification Algorithms: [Train & Predict] . Naive Bayes . Support Vector Machines (C Value, Gamma Value & Karnel) . Decision Trees (Entropy, Min-Sample-Slits, Impurity, Information Gain) . K Nearest Neighbors . Random Forest . Linear Regression/ Logistic Regression . Adaboost (Sometimes Also Called Boosted Decision Tree) UnSupervised Classification algorith