# Statistical Thinking

Get in Touch with Statistics - Not with Cookies

Statistical Thinking provides insights into statistics, **machine learning algorithms**, artificial neural networks, automated social media screenings and codes for facial recognition. You get applicable introductions from a **social scientist** (PDF) and fellow of the Royal Statistical Society, who got his doctorate at the Justus Liebig University Giessen for performing one of the first **longitudinal media analyses**. Reading Herbert George Wells, who said "statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write", made me familiarize myself with Python and R via the Johns Hopkins University (USA) **data science specialization** (PDF), the Stanford University (UK) **machine learning course** (PDF) and **cryptography course** (PDF) as well as the **data science math skills course** (PDF) of the Duke University (USA) and the **mathematics for machine learning specialization** (PDF) of the Imperial College London (UK) on Coursera. In addition, I completed the **deep learning specialization** (PDF) of DeepLearning.AI. Besides my current academic lectures I advise public as well as governmental organisations on the application of multivariate statistics and limitations of artificial intelligence by providing some catchy introductions to **Python and R**.

with examples from

**Prof. Dr. Dennis Klinkhammer**

University of Applied Sciences Teacher

This academic textbook summarizes my most popular introductions to the **R programming language**. It starts with the fundamental basics of **research methods** as well as quantitative research (chapter one) and explains how some of the most common **statistical formulas** are related and how they work (chapter two). In addition, several areas of application of **empirical causal analysis** as well as tips for their interpretation are presented (chapter three). The academic textbook leads to an **introduction into machine learning** algorithms and comes back to their origins within classical statistics (chapter four).

Both a **digital and a print version** (UTB) are available with several programming examples and video tutorials within the fifth chapter. The **table of contents** (PDF) can already be viewed here. Those who have the print version can easily access the **programming examples** (PDF) and **video tutorials** (PDF) online. Thus, the textbook offers a good opportunity to rework all the learning content independently if one should have missed a lecture - which never happens, of course...

Machine learning algorithms shall enable the computer to generalize from experience by using **mathematical models** generated out of training data. For example, a simple Python **code** (ZIP) based upon logistic regression can be used to differentiate between good and bad wines based upon their chemical composition. Another well known machine learning algorithm is the **k-nearest neighbors algorithm**, a quite simple and non-parametric method for classification and regression. Training data is used to generate **vectors in a multidimensional feature space** with appropriate class labels in order to measure distances between data points. The k-nearest neighbors algorithm assumes that **class labels** (GIF) of nearer neighbors are more likely the same than class labels of more distant neighbors. With the corresponding Python **code** (ZIP) this process can be clarified via two-dimensional scatterplots that are merged into a three-dimensional **principal component analysis**. This specific machine learning algorithm as well as its accuracy will be displayed by using the famous IRIS dataset, for which there is also a code programmed in R down below.

Combining **statistical thinking** with a programming language like Python can also be used to create artificial neural networks. They are supposed to **imitate neurons within the human brain** in order to recognise patterns automatically and learn something new without the need to be specifically programmed - just like a machine learning algorithm. Given a common situation, as shown below, artificial neural networks can **predict the correct output data** (on the right) when provided with some **corresponding input data** (on the left):

A human brain identifies easily that the first input column seems to affect the output column. Thus, a new row of input data (010) should correspond to (0) as output data and (110) should correspond consistently to (1). By using a **logistic regression model** with three predictors (one for each column of the input data) the output data can be predicted correctly, if the **automated learning process** is capable of providing adjusted weights for each predictor. That's it - an artificial neural network that regocnises the patterns of each similar situation and adapts automatically. Furthermore, this Python **code** (ZIP) can be reprogrammed for linear and other non-linear contexts as well.

Within social media left- and right-wing extremism can be considered as widespread phenomena with a rising number of **radical content**. For this phenomena numerous theoretical explanations are at hand and Leiden University summarised some of them within their course on **Terrorism and Counterterrorism** (PDF) as did the University of Maryland regarding **Countering Violent Extremism** (PDF). A subsequent and quantitative research focus on the **process of radicalisation** (PDF) shall uncover underlying mechanisms like frames and pull-factors within YouTube. This social media platform can be accessed via **application programming interfaces** in order to identify suspicious actors. Since social media platforms provide large amounts of unstructured data, statistical methods common for **big data** have to be applied. However, most relevant variables in order to **identify suspicious actors** seem to be the number of comments, likes and replies on YouTube as well as the content of each comment which can be itemised via **natural language processing**. Due to some security restrictions this specific code can't be made public, but more general tutorials on machine learning and artificial neural networks can be found on this page.

My PhD-Thesis was supposed to introduce the **longitudinal media analysis** in order to analyze the **portrayal of people** in popular media automatically and without having to see these media yourself. Therefore, the distances of the faces on the y-axis and x-axis were taken into account and examined with regard to **camera setting** and **camera perspective**. Possible interpretations were validated by target specific viewers and an **algorithm for facial recognition** can now be used to analyze the image material second by second.

An extract from the underlying Python **code** (ZIP) can be downloaded right here. It uses **Tensorflow** and **OpenCV** in order to process the images. Therefore, this code is particularly concise and can be easily adapted to individual needs. The simplicity of the code allows it to be **applied to large amounts of images**. Finally, it is an exciting insight into how facial recognition works in smartphones and several computer programs. This code gives you the opportunity to try out **which faces can be recognized** and which probably cannot be recognized.

In addition to my academic lectures and textbook, I offer short YouTube Online Tutorials to **summarize the knowledge** that is required for understanding how quantitative research, machine learning and artificial intelligence work. They focus the fundamental principles of working with variables and applying **research methods** by introducing R as programming language. The statistical basics of **quantitative data analysis**, such as multivariate modelling and significance testing, are presented with practical examples, followed by a step by step introduction to the application of **machine learning algorithms** within R and a graphical representation of the underlying mechanisms.

The YouTube Online Tutorials are currently only available in German but **free for all participants**. Completion of all three parts shall enable you to **get started with R** and to program the first lines of code all by yourself. Please consider, clicking on the provided links takes you to YouTube and the data protection regulations applicable there. Since 2022 there is also a **Massive Open Online Course** (MOOC) available.

In the **first part** (YouTube) you can familiarize yourself with the fundamentals of research methods.

Statistical methods will be presented within the **second part** (YouTube), all with practical examples.

Finally, the **third part** (YouTube) prepares you for programming your first machine learning algorithms.

Let's have some first experiences with R by using the **SWISS** (ZIP) dataset for sociological analysis.

Create a multivariate model with the **MTCARS** (ZIP) dataset and get some insights into applied physics.

The **IRIS** (ZIP) dataset is a perfect playground in order to predict something that is really beautiful.

Learn to calculate the goodness of fit with **SIMULATED** (ZIP) datasets and nonlinear regression models.

Install the package corrplot and visualise the bivariate structure inside the **TREES** (ZIP) dataset.

A **TOOTHGROWTH** (ZIP) dataset for comparing effects of ascorbin acid and orange juice in guinea pigs.