Text to Features using Python

Himang Sharatun
4 min readSep 19, 2018

Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering. — Andrew Ng

When I just started to learn about machine learning and follow Andre Ng course, I don’t really understand, even disagree, with what Andrew Ng means in the quote above. Because I thought that machine learning is mainly about choosing and tuning the best algorithm for your specific case. My naive mind at that moment believe that, as long as you choose the correct algorithm and put some effort to tune the hyper parameter, you will be able to create a good machine learning model. Yes, I used to be that annoying guy that learn machine learning just by reading papers and articles without any practical experience but still believe that my knowledge is good enough. But, the more I dive deeper to practical use of machine learning, I found out that algorithm is indeed just a small part of machine learning. Most of the case, what makes me confuse the most is not about what algorithm to use but what features that suit my data the most. Therefore in this article, I would like to explain about various features extraction technique for text analysis.

Bag of Words

If you remember my previous articles, this technique is the one that I use to classify topics and intent from input text. BOW works by creating an array of dictionary that will has value 1 if the word is in the input text and 0 otherwise. Here is illustration of how BOW works.

To implement this BOW technique in python, you could use scikit-learn module called CountVectorizer as follows:

Term Frequency–Inverse Document Frequency (TFIDF)

In my opinion, TF-IDF is just improvement of the BOW technique. They both rely on list of dictionary that you create from training data but the difference is that TF-IDF is also represent the importance of a word in that specific sentence. Instead of using 1 to represent that certain word exist in the sentence, TF-IDF use range between 0–1 to also represent how important certain word to the sentence.

To implement TF-IDF in python, you can also use sci-kit learn module, but instead of using Countvectorizer, for TF-IDF we will use TfidfVectorizer as follows:

Word2Vec

If you use BOW or TF-IDF, sooner or later you will find a big problem which is matrix size. Since BOW and TF-IDF matrix size is determined by the number of word in dictionary, the bigger your dictionary the bigger matrix size of both technique. In case you never touch machine learning before, bigger matrix size means more time and resources needed to process the matrix. But don’t you worry, if your dictionary is so big you could use word2vec to reduce the matrix dimension.

Another advantage of using word2vec feature representation is that it also capture relationship between word. So for example you have word ‘airplane’, word2vec will be able to tell you that the similar word is ‘airport’, ‘pilot’ and ‘turbin’ for example. Word2vec works by using neural network to determine the features of each word. So instead of making dictionary like BOW and TF-IDF, to create word2vec model you need to make neural network. But don’t worry our good old friend python also have a library for word2vec called gensim. Here is the implementation to train word2vec using python:

If you pay attention to how word2vec works, you will realize that it is not a sentence-level embedding like BOW and TF-IDF but it is just a word-level embedding. It is not a problem if you plan to feed it to RNN or LSTM, but if you just want to create simple neural network you might need to use additional technique to convert word-level word2vec into sentence-level word2vec. Here are 3 techniques to convert word-level word2vec into sentence-level word2vec that I have personally use in my project:

Average Word2Vec

The simplest way to convert word-level word2vec into sentence-level word2vec, is to take average of all word vector inside the sentence. In word3vec each word is already converted into vector that represent its context. And sentence’s meaning is just collaborative meaning of each word that it consist of, therefore to represent sentence in word2vec without losing its contextual meaning we just need to take average meaning of each word. Here is the implementation of sentence-level word2vec using average word word2vec in python:

Average Word2Vec + TFIDF

If previously we only use the average of word vector for sentence embedding, this time we will add TF-IDF in the calculation. Why? As I said before, sentence meaning is just collaborative meaning between each word it consist of, but there is certain degree of importance of word in a sentence. I mean the word “are” contain less meaning than “pilot” in a sentence. Just based on the word “pilot”, we could guess that most likely the sentence is talking about airline or airplane but for the word “are” we can’t guess what the sentence is talking about. Basically, it will not be fair if we weight the word “are” and “pilot” equally in calculating sentence meaning. Therefore we add TF-IDF to weight the importance of vector to a sentence. Here is the implementation of the technique in python:

--

--