What is Data Mining?
Data mining is a computerized technology that uses complicated algorithms to find relationships in large data bases Extensive growth of data gives the motivation to find meaningful patterns among the huge data set.
Data Mining Concepts:
Data mining is the process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data.
It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data mining is the analysis step of the “knowledge discovery in databases (KDD)” process.
Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD).
Data Mining Process
Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.
Data mining doesn’t eliminate the need to know your business, to understand your data, or to understand analytical methods. Data mining assists business analysts with finding patterns and relationships in the data — it does not tell you the value of the patterns to the organization. Furthermore, the patterns uncovered by data mining must be verified in the real world.
Data Mining Software and Algorithms:
Data Mining Software:
• Weka – an open-source software for data mining
• RapidMiner – an open-source system for data and text mining
• KNIME – an open-source data integration, processing, analysis, and exploration platform
• The Mahout machine learning library – mining large data sets. It supports recommendation mining, clustering, classification and frequent itemset mining.
• Rattle – a GUI for data mining using R
• R-Programming – The R Project for Statistical Computing
• Orange – Interactive data analysis workflows with a large toolbox.
• NLTK– Python programs to work with human language data
• Scikit-learn– Machine Learning in Python
• MATLAB– Analyze and design the systems and products transforming our world
• Libsvm– A Library for Support Vector Machines
Data Mining Algorithms:
1. C4.5
C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified. Systems that construct classifiers are one of the commonly used tools in data mining. Such systems take as input a collection of cases, each belonging to one of a small number of classes and described by its values for a fixed set of attributes, and output a classifier that can accurately predict the class to which a new case belongs.
A popular open-source Java implementation can be found over at OpenTox uses C4.5. Orange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.
2. k-means
The k-means algorithm is a simple iterative method to partition a given dataset into a user specified number of clusters, k. k-means creates k groups from a set of objects so that the members of a group are more similar. It’s a popular cluster analysis technique for exploring a dataset.
k-means clustering are used in Apache Mahout, Julia, R, SciPy, Weka, MATLAB, SAS data mining tools.
3. Support vector machines (SVM)
Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all. In today’s machine learning applications, support vector machines (SVM) are considered a must try—it offers one of the most robust and accurate methods among all well-known algorithms. It has a sound theoretical foundation, requires only a dozen examples for training, and is insensitive to the number of dimensions. In addition, efficient methods for training SVM are also being developed at a fast pace.
It’s used in scikit-learn, MATLAB and of course libsvm data mining tools
4. Apriori
One of the most popular data mining approaches is to find frequent itemsets from a transaction dataset and derive association rules. Finding frequent itemsets (itemsets with frequency larger than or equal to a user specified minimum support) is not trivial because of its combinatorial explosion. Once frequent itemsets are obtained, it is straightforward to generate association rules with confidence larger than or equal to a user specified minimum confidence. Apriori is a seminal algorithm for finding frequent itemsets using candidate generation. The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.
Apriori algorithm popularly used in ARtool, Weka, and Orange tools.
5. Expectation-Maximization (EM)
Expectation-Maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery. In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables.
EM algorithm is used in Weka and also implemented in some package & modules in R& scikit-learn tools.
6. PageRank
PageRank is a search ranking algorithm using hyperlinks on the Web. PageRank produces a static ranking of Web pages in the sense that a PageRank value is computed for each page off-line and it does not depend on search queries. It’s a type of network analysis looking to explore the associations among objects.
Below are some of PageRank Algorithm implemented areas.
- C++ OpenSource PageRank Implementation
- Python PageRank Implementation
- igraph – The network analysis package (R)
7. AdaBoost
AdaBoost is a boosting algorithm which constructs a classifier and a classifier takes a bunch of data and attempts to predict or classify which class a new data element belongs to. Boosting is an ensemble learning algorithm which takes multiple learning algorithms (e.g. decision trees) and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.
AdaBoost has a lots of implementations and variants. Here are a few:
scikit-learn, ICSIBoost, gbm: Generalized Boosted Regression Models
8. kNN
kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the AdaBoost classifiers described because it’s a lazy learner. A lazy learner doesn’t do much during the training process other than store the training data. Only when new unlabeled data is input does this type of learner look to classify.
KNN algorithm implemented in MATLAB k-nearest neighbor classification, scikit-learn KNeighborsClassifier, k-Nearest Neighbour Classification in R
9. Naive Bayes
Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption: Every feature of the data being classified is independent of all other features given the class. Two features are independent when the value of one feature has no effect on the value of another feature.
Naïve Bayes algorithm implemented in Orange, scikit-learn, Weka and R.
10. CART
CART stands for classification and regression trees. It is a decision tree learning technique that outputs either classification or regression trees. Like C4.5, CART is a classifier. A classification tree is a type of decision tree. The output of a classification tree is a class.
scikit-learn implements CART in their decision tree classifier. R’s tree package has an implementation of CART. Weka and MATLAB also have implementations.
There are many tools & algorithms are there for data mining and I have collected at very high level for top tools & algorithms which are currently used in market for data mining.