is basically defined as conversion of huge data or documents into useful
numbers. Text mining is used to analyze useful or meaningful information from
raw data with use of various algorithms and patterns in the data. Text mining
is used for unstructured data or Semi structured data such as Emails, text
message. It used to filter out spam message in emails by identifying certain
text common is such emails. After certain information retrieval from the
data/documents this data is used in data mining projects (clustering and
factoring, graphics, predictive data mining).
Mining is the same as Data Mining except for the fact that Text Mining works on
raw or unstructured text such as Emails, HTML or
Text Documents while Data mining works on structured data.
Common aspects of Text Mining include removing certain keyword like “THE”,
punctuation marks etc. from the important data to improve search quality. We
will learn about it in preprocessing text
Mining is used in various Educational, Research and Industrial purposes such as
Social Media, Research Papers, and Sentimental Analysis etc.
FOR PREPROCESSING TEXT
To Reduce the Size of Text Document
i) To eliminate words according to their
It is used to eliminate common words or stop words like “the” “and”, etc.
To Improve Efficiency and Performance of Information Retrieval System in Text
It can save Administrator significant amount of time and space resources.
WAYS OF PREPROCESSING TEXT
is the process of deciphering textual content into meaning full words, terms or
symbols which are known as tokens. These words are differentiated using full
stops, commas, and whitespaces. Tokenization is dependent on the languages used
for English language Tokenization is a simple task while for languages like
Chinese, Korean it’s a difficult task to perform.
MINING IS THE PROCESS OF RETRIVAL OF IMPORTANT INFORMATION FROM UNSTRUCTURED
Stop Word Removal
The Major aim of stop word removal is to make
reduce the dimensionality of the text by removing certain prepositions,
articles, pre-nouns those are not necessary for text mining. This reduces text
data significantly and helps in optimizing the data. The list of stop words is
available online . Another way of building a stop word list based on frequency
of word in a number of Documents.
methods of Stop Word Removal are:-
Term Based Random Sampling
Mutual Information Method
Based on Precompiled List
Application of Text
main objective of text mining is to reduce time utilization and filtering out
unnecessary data from the main keywords or important data. It is used to
provide better services to the users by giving proper feedback. It is used to
by businesses to analyze consumer base and provide services accordingly by
targeting the potential customers.
As Filtering based
on IP address is not sufficient certain techniques of Text Mining are uses to
detect salting. Salting is basically adding certain information to make it look
like original or official content. Email service providing companies uses text
mining to filter out spam messages, promotional message from the rest of
important messages thus saving users time and resources. This can be used for
further filtering out messages according to the suitable age group. It is used
to provide protection against phishing and spamming.
is used to identify positive, negative or neutral reviews about a subject.
Consider a watching a TV SERIES based on the reviews of viewers. The text used
in writing reviews is analyzed and according to the keywords used the emotion
of the user is identified which can be used for marking them as positive or
negative reviews of the show. It also focuses on the words and phrases to
identify how negative or positives these words are.
Statement -“I LOVED THE NEW MOBILE. BUT IT IS VERY EXPENSIVE AND DOES NOT HAVE
GREAT BATTERY LIFE”.
According to the
first line the customer seems impressed but the overall the customer has a
negative impression of the product.
are used to give indication about products such as while reading reviews about
a hotel you come across a word ROTTEN this
Create a negative
impression about the hotels.
Year by Year the
numbers of researches in medical fields are increasingly significantly thus the
necessity of text mining is evident text mining is used for quickly sorting out
the necessary data from medical record which are available. IN FIELDS like
Cancer treatment text mining means improvising diagnostics, treatment, and
prevention of cancer by mining of database.
use of text mining is mining EHR (Electronic Health Record) is used to search
the patients previous records of certain diseases and medical history.
Text Mining is used
in for comparing gene markers with the previous
identifying different pattern in genes for checking diseases.
Social media are a
rich form of Unstructured Data. Social media is used connecting people i.e.
interactions and conversations. Some of these well known platforms are twitter,
facebook, orkut. Data can be gathered using APIs.