##
Posts Tagged ‘**algorithms**’

### It is Not the Machine that is learning. Is human algorithms forcing everyone to adapt or die?

Posted by: adonis49 on: November 16, 2020

# Which machine learning algorithm should I use? How many and which one is best?

**Note:** in the early 1990’s, I took graduate classes in **Artificial Intelligence (AI) (The if…Then series of questions and answer of experts in their fields of work) and neural networks developed by psychologists.**

The concepts are the same, though upgraded with new algorithms and automation.

I recall a book with a **Table (like the Mendeleev table in chemistry) that contained the terms, mental processes, mathematical concepts behind the ideas** that formed the AI trend…

There are several lists of methods, depending on the field of study you are more concerned with.

One** list** of methods is constituted of methods that human factors are trained to utilize if need be, such as:

Verbal protocol, neural network, utility theory, preference judgments, psycho-physical methods, operational research, prototyping, information theory, cost/benefit methods, various statistical modeling packages, and expert systems.

There are those that are intrinsic to **artificial intelligence methodology** such as:

**Fuzzy logic, robotics, discrimination nets, pattern matching, knowledge representation, frames, schemata, semantic network, relational databases, searching methods, zero-sum games theory, logical reasoning methods, probabilistic reasoning, learning methods, natural language understanding, image formation and acquisition, connectedness, cellular logic, problem solving techniques, means-end analysis, geometric reasoning system, algebraic reasoning system.**

Hui Li on Subconscious Musings posted on April 12, 2017 Advanced Analytics | Machine Learning

This resource is designed primarily for beginner to intermediate data scientists or analysts who are interested in identifying and applying machine learning algorithms to address the problems of their interest.

A **typical question asked by a beginner**, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?”

The answer to the question varies depending on many factors, including:

- The size, quality, and nature of data.
- The available computational time.
- The urgency of the task.
- What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms.

We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

## The machine learning algorithm cheat sheet

The **machine learning algorithm cheat sheet** helps you to choose from a variety of machine learning algorithms to find the appropriate algorithm for your specific problems.

This article walks you through the process of how to use the sheet.

Since the cheat sheet is designed for beginner data scientists and analysts, we will make some simplified assumptions when talking about the algorithms.

**The algorithms recommended here result from compiled feedback and tips from several data scientists and machine learning experts and developers.**

There are several issues on which we have not reached an agreement and for these issues we try to highlight the commonality and reconcile the difference.

Additional algorithms will be added in later as our library grows to encompass a more complete set of available methods.

**How to use the cheat sheet**

Read the path and algorithm labels on the chart as “If *<path label>* then use *<algorithm>*.” For example:

- If you want to perform dimension reduction then use principal component analysis.
- If you need a numeric prediction quickly, use decision trees or logistic regression.
- If you need a hierarchical result, use hierarchical clustering.

Sometimes more than one branch will apply, and other times none of them will be a perfect match.

It’s important to remember these paths are intended to be **rule-of-thumb** recommendations, so some of the recommendations are not exact.

Several data scientists I talked with said that the only sure way to find the very best algorithm is to **try all of them.**

**(Is that a process to find an algorithm that matches your world view on an issue? Or an answer that satisfies your boss?)**

**Types of machine learning algorithms**

This section provides an overview of the most popular types of machine learning. If you’re familiar with these categories and want to move on to discussing specific algorithms, you can skip this section and go to “When to use specific algorithms” below.

**Supervised learning**

Supervised learning algorithms make predictions based on a **set of examples.**

For example, historical sales can be used to estimate the future prices. With supervised learning, you have an input variable that consists of labeled training data and a desired output variable.

You use an algorithm to analyze the training data to learn the function that maps the input to the output. This inferred function maps new, unknown examples by generalizing from the training data to anticipate results in unseen situations.

**Classification:**When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.**Regression:**When predicting**continuous values,**the problems become a regression problem.**Forecasting:**This is the process of making predictions about the future based on the past and present data. It is most commonly used to**analyze trends**. A common example might be estimation of the next year sales based on the sales of the current year and previous years.

**Semi-supervised learning**

The challenge with supervised learning is that labeling data can be expensive and time consuming. If labels are limited, you can use unlabeled examples to enhance supervised learning. Because the machine is not fully supervised in this case, we say the machine is semi-supervised. With semi-supervised learning, you use unlabeled examples with a small amount of labeled data to improve the learning accuracy.

**Unsupervised learning**

When performing unsupervised learning, the machine is presented with totally unlabeled data. It is asked to discover the intrinsic patterns that underlies the data, such as a clustering structure, a low-dimensional manifold, or a sparse tree and graph.

**Clustering:**Grouping a set of data examples so that examples in one group (or one cluster) are more similar (according to some criteria) than those in other groups. This is often used to segment the whole dataset into several groups. Analysis can be performed in each group to help users to find intrinsic patterns.**Dimension reduction:**Reducing the number of variables under consideration. In many applications, the raw data have very high dimensional features and some features are redundant or irrelevant to the task. Reducing the dimensionality helps to find the true, latent relationship.

**Reinforcement learning**

Reinforcement learning analyzes and optimizes the behavior of an agent based on the feedback from the environment. Machines try different scenarios to discover which actions yield the greatest reward, rather than being told which actions to take. Trial-and-error and delayed reward distinguishes reinforcement learning from other techniques.

**Considerations when choosing an algorithm**

When choosing an algorithm, always take these aspects into account: accuracy, training time and ease of use. Many users put the accuracy first, while beginners tend to focus on algorithms they know best.

When presented with a dataset, the first thing to consider is how to obtain results, no matter what those results might look like. Beginners tend to choose algorithms that are easy to implement and can obtain results quickly. This works fine, as long as it is just the first step in the process. Once you obtain some results and become familiar with the data, you may spend more time using more sophisticated algorithms to strengthen your understanding of the data, hence further improving the results.

Even in this stage, the best algorithms might not be the methods that have achieved the highest reported accuracy, as an algorithm usually requires careful tuning and extensive training to obtain its best achievable performance.

**When to use specific algorithms**

Looking more closely at individual algorithms can help you understand what they provide and how they are used. These descriptions provide more details and give additional tips for when to use specific algorithms, in alignment with the cheat sheet.

**Linear regression and Logistic regression**

Linear regressionLogistic regression

Linear regression is an approach for modeling the relationship between a continuous dependent variable *[Math Processing Error]y* and one or more predictors *[Math Processing Error]X*. The relationship between *[Math Processing Error]y* and *[Math Processing Error]X* can be linearly modeled as *[Math Processing Error]y=βTX+ϵ* Given the training examples *[Math Processing Error]{xi,yi}i=1N*, the parameter vector *[Math Processing Error]β* can be learnt.

If the dependent variable is not continuous but categorical, linear regression can be transformed to logistic regression using a logit link function. Logistic regression is a simple, fast yet powerful classification algorithm.

Here we discuss the binary case where the dependent variable *[Math Processing Error]y* only takes binary values *[Math Processing Error]{yi∈(−1,1)}i=1N* (it which can be easily extended to multi-class classification problems).

In logistic regression we use a different hypothesis class to try to predict the probability that a given example belongs to the “1” class versus the probability that it belongs to the “-1” class. Specifically, we will try to learn a function of the form:*[Math Processing Error]p(yi=1|xi)=σ(βTxi)* and *[Math Processing Error]p(yi=−1|xi)=1−σ(βTxi)*.

Here *[Math Processing Error]σ(x)=11+exp(−x)* is a sigmoid function. Given the training examples*[Math Processing Error]{xi,yi}i=1N*, the parameter vector *[Math Processing Error]β* can be learnt by maximizing the Pyongyang said it could call off the talks, slated for June 12, if the US continues to insist that it give up its nuclear weapons. North Korea called the military drills between South Korea and the US a “provocation,” and canceled a meeting planned for today with South Korea.of *[Math Processing Error]β* given the data set.Group By Linear RegressionLogistic Regression in SAS Visual Analytics

**Linear SVM and kernel SVM**

Kernel tricks are used to map a non-linearly separable functions into a higher dimension linearly separable function. A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector *[Math Processing Error]w* and bias *[Math Processing Error]b* of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:*[Math Processing Error]minimizew||w||subject toyi(wTXi−b)≥1,i=1,…,n.*

A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector and bias of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:

When the classes are not linearly separable, a kernel trick can be used to map a non-linearly separable space into a higher dimension linearly separable space.

When most dependent variables are numeric, logistic regression and SVM should be the first try for classification. These models are easy to implement, their parameters easy to tune, and the performances are also pretty good. So these models are appropriate for beginners.

**Trees and ensemble trees**

**Decision trees**, random forest and gradient boosting are all algorithms based on decision trees.

There are many variants of decision trees, but they all do the same thing – subdivide the feature space into regions with mostly the same label. Decision trees are easy to understand and implement.

However, they tend to **over fit data** when we exhaust the branches and go very deep with the trees. **Random Forrest** and gradient boosting are two popular ways to use tree algorithms to achieve good accuracy as well as overcoming the over-fitting problem.

**Neural networks and deep learning**

Neural networks flourished in the mid-1980s due to their **parallel and distributed processing ability**.

Research in this field was impeded by the ineffectiveness of the **back-propagation training algorithm** that is widely used to optimize the parameters of neural networks. Support vector machines (SVM) and other simpler models, which can be easily trained by solving **convex optimization problems,** gradually replaced neural networks in machine learning.

In recent years, new and improved training techniques such as unsupervised pre-training and layer-wise greedy training have led to a resurgence of interest in neural networks.

Increasingly powerful computational capabilities, such as graphical processing unit (GPU) and massively parallel processing (MPP), have also spurred the revived adoption of neural networks. The resurgent research in neural networks has given rise to the invention of models with thousands of layers.

Shallow neural networks have evolved into deep learning neural networks.

Deep neural networks have been very successful for supervised learning. When used for speech and image recognition, deep learning performs as well as, or even better than, humans.

Applied to unsupervised learning tasks, such as feature extraction, deep learning also extracts features from raw images or speech with much less human intervention.

**A neural network consists of three parts: input layer, hidden layers and output layer. **

The training samples define the input and output layers. When the output layer is a categorical variable, then the neural network is a way to address classification problems. When the output layer is a continuous variable, then the network can be used to do regression.

When the output layer is the same as the input layer, the network can be used to extract intrinsic features.

The number of hidden layers defines the model complexity and modeling capacity.

**Deep Learning:** What it is and why it matters

**k-means/k-modes, GMM (Gaussian mixture model) clustering**

K Means ClusteringGaussian Mixture Model

Kmeans/k-modes, GMM clustering aims to partition n observations into k clusters. K-means define hard assignment: the samples are to be and only to be associated to one cluster. GMM, however define a soft assignment for each sample. Each sample has a probability to be associated with each cluster. Both algorithms are simple and fast enough for clustering when the number of clusters k is given.

**DBSCAN**

When the number of clusters k is not given, DBSCAN (density-based spatial clustering) can be used by connecting samples through density diffusion.

**Hierarchical clustering**

Hierarchical partitions can be visualized using a tree structure (a dendrogram). It does not need the number of clusters as an input and the partitions can be viewed at different levels of granularities (i.e., can refine/coarsen clusters) using different K.

**PCA, SVD and LDA**

We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the “intrinsic” dimensionality may be smaller than the number of features. Principal component analysis (PCA), singular value decomposition (SVD), andlatent Dirichlet allocation (*LDA*) all can be used to perform dimension reduction.

PCA is an unsupervised clustering method which maps the original data space into a lower dimensional space while preserving as much information as possible. The PCA basically finds a subspace that most preserves the data variance, with the subspace defined by the dominant eigenvectors of the data’s covariance matrix.

The SVD is related to PCA in the sense that SVD of the centered data matrix (features versus samples) provides the dominant left singular vectors that define the same subspace as found by PCA. However, SVD is a more versatile technique as it can also do things that PCA may not do.

For example, the SVD of a user-versus-movie matrix is able to extract the user profiles and movie profiles which can be used in a recommendation system. In addition, SVD is also widely used as a topic modeling tool, known as latent semantic analysis, in natural language processing (NLP).

A related technique in NLP is latent Dirichlet allocation (LDA). LDA is probabilistic topic model and it decomposes documents into topics in a similar way as a Gaussian mixture model (GMM) decomposes continuous data into Gaussian densities. Differently from the GMM, an LDA models discrete data (words in documents) and it constrains that the topics are *a priori* distributed according to a Dirichlet distribution.

**Conclusion****s**

This is the work flow which is easy to follow. The takeaway messages when trying to solve a new problem are:

- Define the problem. What problems do you want to solve?
- Start simple. Be familiar with the data and the baseline results.
- Then try something more complicated.

**Dr. Hui Li is a Principal Staff Scientist of Data Science Technologies at SAS.**Her current work focuses on Deep Learning, Cognitive Computing and SAS recommendation systems in SAS Viya. She received her PhD degree and Master’s degree in Electrical and Computer Engineering from Duke University.- Before joining SAS, she worked at Duke University as a research scientist and at Signal Innovation Group, Inc. as a research engineer. Her research interests include machine learning for big, heterogeneous data, collaborative filtering recommendations, Bayesian statistical modeling and reinforcement learning.

**What the “Islamic Empire” did a thousand-year ago?**

I say Islamic Empire because it is Not the “Arabs” who came from the Peninsula who brought civilization and culture to the vast empire: They were the Syrians, Iraqis, Iranians, Egyptians with education and knowledge with different languages and sciences that translated all the previous knowledge into the Arabic language and added immensely to human knowledge.

“A thousand years is a long time; the first book published in French wasn’t until 1476.

Goodness knows what an Islamic caliphate would have been doing 1,000 years ago? They bought rare books in various languages with gold and swapping prisoners.

They built the House of Wisdom in Baghdad, one of the first universities in the world;

they asked scholars of all faiths to translate every text ever written into Arabic;

they demanded the first qualifications for doctors,

founded the **first psychiatric hospitals and invented ophthalmology.**

They developed **alg****ebra (algorithms** are named after their Arab father) and a programmable machine … a computer.

They **introduced Aristotle to Europe,**

**Al-Jahiz** began **theories of natural selection,**

they discovered the **Andromeda galaxy,**

**Classified the spinal nerve**s and

Created **hydropower using pumps and gears.**

The Wahhabi terrorists of Saudi Kingdom in ISIS and Al Nusra want to destroy the knowledge that Islam is a beautiful, scientific and intelligent culture, and we are way ahead of them.”

**Hacking OkCupid: And Chris McKinlay finding “True Love” **

What large-scale data processing and **parallel numerical methods have to do with falling in love?**

OkCupid was founded by Harvard math majors in 2004, and it first caught daters’ attention because of its computational approach to matchmaking.

Members answer droves of multiple-choice survey questions on everything from politics, religion, and family to love, sex, and smartphones.

OkCupid lets users see the responses of others, but only to questions they’ve answered themselves.

KEVIN POULSEN posted this Jan. 21, 2014

# How a Math Genius Hacked OkCupid to Find True Love

Chris McKinlay was folded into a cramped fifth-floor cubicle in UCLA’s math sciences building, lit by a single bulb and the glow from his monitor.

It was 3 am, the optimal time to squeeze cycles out of the supercomputer in Colorado that he was using for his PhD dissertation.

(The subject: large-scale data processing and **parallel numerical methods**.)

While the computer chugged, he clicked open a second window to **check his OkCupid inbox**.

Mathematician **Chris McKinlay** hacked OKCupid to find the girl of his dreams

McKinlay, a lanky 35-year-old with tousled hair, was one of about 40 million Americans looking for romance through websites like **Match.com, J-Date, and e-Harmony**, and he’d been searching in vain since his last breakup 9 months earlier.

He’d sent dozens of **cutesy** introductory messages to women touted as potential matches by **OkCupid’s algorithms**. Most were ignored;** he’d gone on a total of 6 first dates.**

On that early morning in June 2012, his compiler crunching out machine code in one window, his forlorn dating profile sitting idle in the other, **it dawned on him that he was doing it wrong**.

He’d been approaching online matchmaking like any other user. Instead, he realized, he should be **dating like a mathematician**.

**On average, respondents select 350 questions from a pool of thousands—“Which of the following is most likely to draw you to a movie?” or “How important is religion/God in your life?” **

For each, the user records an answer, specifies which responses they’d find acceptable in a mate, and rates how important the question is to them on a **5-point scale from “irrelevant” to “mandatory.”** OkCupid’s matching engine uses that data to calculate a couple’s compatibility. The closer to 100 percent—mathematical soul mate—the better.

But **mathematically, McKinlay’s compatibility with women in Los Angeles was abysmal. **

OkCupid’s algorithms use only the questions that both potential matches decide to answer, and the match questions McKinlay had chosen—more or less at random—had proven unpopular.

When he scrolled through his matches, fewer than 100 women would appear above the 90 percent compatibility mark. And that was in a city containing some 2 million women (approximately 80,000 of them on OkCupid).

On a site where **compatibility equals visibility,** he was practically a ghost.

He realized he’d have to boost that number. If, through statistical sampling, McKinlay could ascertain which questions mattered to the kind of women he liked, he could construct a new profile that honestly answered those questions and ignored the rest.

He could match every woman in LA who might be right for him, and none that weren’t.

Chris McKinlay used **Python scripts** to riffle through hundreds of OkCupid survey questions. He then sorted female daters into 7 **clusters,** like “Diverse” and “Mindful,” each with distinct characteristics. **Maurico Alejo**

Even for a mathematician, McKinlay is unusual.

Raised in a Boston suburb, he graduated from Middlebury College in 2001 with a degree in Chinese. In August of that year he took a part-time job in New York **translating Chinese into English** for a company on the 91st floor of the north tower of the **World Trade Center**.

The towers fell 5 weeks later. (McKinlay wasn’t due at the office until 2 o’clock that day. He was asleep when the first plane hit the north tower at 8:46 am.) “After that I asked myself what I really wanted to be doing,” he says.

A friend at Columbia recruited him into an offshoot of MIT’s famed professional blackjack team, and he spent the next few years bouncing between New York and Las Vegas, counting cards and earning up to $60,000 a year.

The experience kindled his interest in applied math, ultimately inspiring him to earn a master’s and then a PhD in the field. “**They were capable of using mathematics in lots of different situations,” he says. “They could see some new game—like Three Card Pai Gow Poker—then go home, write some code, and come up with a strategy to beat it.**”

Now he’d do the same for love. First he’d need data.

While his dissertation work continued to run on the side, he set up 12 fake OkCupid accounts and wrote a Python script to manage them. The script would search his target demographic (heterosexual and bisexual women between the ages of 25 and 45), visit their pages, and scrape their profiles for every scrap of available information: ethnicity, height, smoker or nonsmoker, astrological sign—“all that crap,” he says.

To find the survey answers, he had to do a bit of extra sleuthing.

OkCupid lets users see the responses of others, but only to questions they’ve answered themselves.

McKinlay set up his bots to simply **answer each question randomly**—he wasn’t using the dummy profiles to attract any of the women, so the answers didn’t matter—then scooped the women’s answers into a database.

McKinlay watched with satisfaction as his bots purred along. Then, after about a thousand profiles were collected, he hit his first roadblock.

**OkCupid has a system in place to prevent exactly this kind of data harvesting: It can spot rapid-fire use easily. One by one, his bots started getting banned.**

**He would have to train them to act human.**

He turned to his friend **Sam Torrisi**, a neuroscientist who’d recently taught McKinlay **music theory** in exchange for advanced math lessons.

Torrisi was also on OkCupid, and he agreed to install **spyware** on his computer to monitor his use of the site. With the data in hand, McKinlay programmed his bots to **simulate Torrisi’s click-rates and typing speed**.

He brought in a second computer from home and plugged it into the math department’s broadband line so it could run uninterrupted 24 hours a day.

After 3 weeks he’d harvested **6 million questions and answers from 20,000 women** all over the country.

McKinlay’s dissertation was relegated to a side project as he dove into the data. He was already sleeping in his cubicle most nights. Now he gave up his apartment entirely and moved into the dingy beige cell, laying a thin mattress across his desk when it was time to sleep.

For McKinlay’s plan to work, he’d have to find a pattern in the survey data—a way to **roughly group the women according to their similarities**.

The breakthrough came when he coded up a modified **Bell Labs algorithm called K-Modes**.

First used in 1998 to analyze **diseased soybean crops**, **K-Modes **takes **categorical data and clumps it like the colored wax swimming in a Lava Lamp. With some fine-tuning he could adjust the viscosity of the results, thinning it into a slick or coagulating it into a single, solid glob.**

He played with the dial and found a natural resting point where the 20,000 women clumped into 7 **statistically distinct clusters** based on their questions and answers. “I was ecstatic,” he says. “That was the high point of June.”

He retasked his bots to gather another sample: 5,000 women in Los Angeles and San Francisco who’d logged on to OkCupid in the past month.

Another pass through K-Modes confirmed that they clustered in a similar way. His statistical sampling had worked.

Now he just had to decide which cluster best suited him. He checked out some profiles from each. One cluster was too young, two were too old, another was too Christian.

**But he lingered over a cluster dominated by women in their mid-twenties who looked like indie types, musicians and artists. This was the golden cluster. The haystack in which he’d find his needle. Somewhere within, he’d find true love.**

Actually, **a neighboring cluster looked pretty cool too—slightly older women who held professional creative jobs, like editors and designers. He decided to go for both. **

He’d set up two profiles and optimize one for the A group and one for the B group.

He text-mined the two clusters to learn what interested them; teaching turned out to be a popular topic, so he wrote a bio that emphasized his work as a math professor.

The important part, though, would be the survey.

He picked out the 500 questions that were most popular with both clusters. He’d already decided he would fill out his answers honestly—he didn’t want to build his future relationship on a foundation of computer-generated lies.

But he’d let his computer figure out how much importance to assign each question, using a machine-learning algorithm called adaptive boosting to derive the best weightings.