Basic Workflows/Algorithms for machine Learning in Text Processing
Web Extraction:
- Apache Tika (or Any Custom Crawler)
Concept Extraction
- Sentence Detection (maxent-3.0.0.jar, opennlp-tools-1.5.3.jar)
- NER(Named Entity Recognition)
- Concept Extraction (snowball-stemmer-1.3.0.581.1.jar)
- Multi phrase to list phrase
Concept Filtering
- Zipf Filtering
- ChiSq Filtering
- Low frequency Filtering
- Signal Filtering
String Indexer:
- StringIndexer (spark-mlib_2.10-2.1.0.jar)
Hashing TF
- HashingTF (spark-mlib_2.10-2.1.0.jar)
- IDF (spark-mlib_2.10-2.1.0.jar)
Classifier for Supervised Classification Algorithms: [Train & Predict]
- . Naive Bayes
- . Support Vector Machines (C Value, Gamma Value & Karnel)
- . Decision Trees (Entropy, Min-Sample-Slits, Impurity, Information Gain)
- . K Nearest Neighbors
- . Random Forest
- . Linear Regression/Logistic Regression
- . Adaboost (Sometimes Also Called Boosted Decision Tree)
UnSupervised Classification algorithms
- Clustering/SubClustering (Scoring)
- k-means
- Density based Spatial clustering (DBSCAN) (SubClustering Algorithm)
- Mixture models
- Hierarchical clustering
- anomaly detection(OutLier's Detection)
- Neural Networks
- Hebbian Learning
- Generative Adversarial Networks
Comments