Basic Workflows/Algorithms for machine Learning in Text Processing


Web Extraction:
  •  Apache Tika (or Any Custom Crawler)
Concept Extraction

  •     Sentence Detection (maxent-3.0.0.jar, opennlp-tools-1.5.3.jar)
  •     NER(Named Entity Recognition)
  •     Concept Extraction (snowball-stemmer-1.3.0.581.1.jar)
  •     Multi phrase to list phrase
Concept Filtering

  •   Zipf Filtering
  •   ChiSq Filtering
  •   Low frequency Filtering
  •   Signal Filtering
String Indexer:

  •    StringIndexer (spark-mlib_2.10-2.1.0.jar)
Hashing TF

  •    HashingTF (spark-mlib_2.10-2.1.0.jar)
  •    IDF (spark-mlib_2.10-2.1.0.jar)

Classifier for Supervised Classification Algorithms:  [Train & Predict]

  • . Naive Bayes 
  • . Support Vector Machines (C Value, Gamma Value & Karnel)
  • . Decision Trees  (Entropy, Min-Sample-Slits, Impurity, Information Gain)
  • . K Nearest Neighbors
  • . Random Forest
  • Linear Regression/Logistic Regression
  • . Adaboost (Sometimes Also Called Boosted Decision Tree)

UnSupervised Classification algorithms 
  • Clustering/SubClustering (Scoring)
    • k-means
    • Density based Spatial clustering (DBSCAN) (SubClustering Algorithm)
    • Mixture models
    • Hierarchical clustering
  • anomaly detection(OutLier's Detection)
  • Neural Networks
    • Hebbian Learning
    • Generative Adversarial Networks

Comments