ML: needs to be covered and tasks to be solved. Machine learning.Part 3

ML is a tool for solving a specific class of problems.

Let’s consider the following example to understand the main types of problems that machine learning algorithms solve, and why such issues are unsolvable (or hard to be solved efficiently) using explicit methods.

For example, we need a program that defines a photo of fruit as an apple or a tangerine. (Alternatively, designing a program that has to recognize a malignant tumor on an X-ray image, detect fraud, etc.)

1) The obvious way to solve this task is to get a person (an employee) with good eyesight to label the given photo. Clearly, this approach has its drawbacks:
a) The employee may get tired, sick, or skip work for some other reason
b) The employee needs to get paid, provided with insurance and a replacement during vacation
c) A person (in comparison with a computer) is rather slow in decision making
d) Humans make mistakes

2) Consequently, we decided to write a program that would solve our problem instead of a person. We could even gather the world’s leading experts on tangerines and apples to describe all the possible ways to differentiate these fruits from each other. As a result, we would get a program that, based on the color of the fruit, the length of the leaf and the oblongness of the fruit, tells us whether it is an apple or a tangerine. The system could work fine for a while until we come across an apple with a leaf similar in shape to a tangerine one, or a reddish apple (almost orange) like a tangerine, or a rather round-shaped apple. The human would immediately see that it is an apple, but our program would conclude: “This is a tangerine!” Then we would gather the experts again, investigate the problematic cases, add some rules to the program… The upgrading of the program in this fashion will happen from time to time until we achieve the desired result.
Cons of this approach:
a) High development cost (experts fees, programmers salaries, etc.)
b) The resulting program will be very complex and difficult to maintain.
c) Long development span
d) The inability to discover all necessary dependencies in the problem field at once, and to jointly describe all possible cases and differences

3) Understanding that options 1 and 2 are unsuitable for us, we look for alternative ways to solve this problem and turn to machine learning algorithms. Using suitable ML algorithms, we get a program that, after being trained on a large number of photos of apples and tangerines, can extract the necessary dependencies from the data and solve our problem.
Cons:
a) A large amount of labeled data is required (a large number of photos of apples and tangerines marked accordingly)

Now let’s imagine that we have not 2 but 100 types of fruits: in this case, option 2 (writing our software without using ML) becomes completely infeasible. Even more, much more complicated problems, rather than fruit recognition, are possible, which would in turn be more difficult to implement explicitly. For example, malignant tumor recognition on X-ray images, fraudulent activities with bank cards, spam detection, speech recognition, etc.. Some tasks are just irrational (almost impossible) to solve without the ML tools.

Having figured out why we need ML, let’s see what tasks it is suitable for.

The main tasks that machine learning algorithms solve are issues that are difficult or impossible or not rational to solve in direct, explicit software or in an analytical way. Among these tasks (according to the type of problems), the following 4 can be distinguished:

Regression is the problem of predicting a continuous numerical value for a specific object based on its characteristics. For example, the real estate market prices forecast, the temperature forecast, the prediction of the amount of money the client will spend in the store, etc.

Classification is the task of predicting a categorical attribute of an object. For example, categorization of incoming emails into spam and non-spam, the task of credit scoring, image classification, etc.

Anomaly detection is the task of identifying items, events or observations that do not match the expected pattern or other items in the dataset. For example fraud detection, system failure detection, discovering mistakes in a text, etc.

Clustering is the task of grouping similar objects into clusters. Unlike the classification problem, the number of clusters and which cluster (which group) the objects in the dataset belong to are not known in advance.

What is AI, ML and Data Science? Machine learning.Part2

We continue the series of articles on AI and ML, let us first consider the definition and align what we mean under each term.

What is artificial intelligence (AI), machine learning (ML) and Data Science?

We will consider the question: “What is AI, ML, Data Science, and how do they differ?”

The term Artificial Intelligence is widely used and often understood as a kind of a system (an abstraction) that bears the properties of human intelligence, a system that can think, solve problems (including creative ones); these require the presence of a thought process in the implementation core of the system. Perhaps the most important thing to know is that artificial intelligence as described above does not exist in the present yet. Our brain and consciousness as a whole are too complex and not yet sufficiently studied to digitize them or make a mathematical model that mimics the working of our thought process. However, there are attempts (including quite fruitful ones) to imitate the activity of our brain in solving certain specific problems. A notable direction in AI that falls under this description is ML.

Before moving on to ML, let’s give a more rigorous definition of AI.

Artificial intelligence (AI) is an engineering and mathematical discipline that creates programs, devices and mathematical abstractions that simulate cognitive (intellectual) functions of a person, including, inter alia, data analysis and decision making.

Strong AI / Super-AI – an intelligent algorithm capable of solving a wide range of intellectual tasks, at least on a par with the human mind.
Narrow AI, Weak AI is an intelligent algorithm that imitates the human mind in solving specific highly specialized tasks (playing chess, recognizing faces, communicating in natural language, searching for information, etc.).

Machine learning definition

Machine learning is a class of artificial intelligence methods, designed to draw from data by learning through experience in solving many similar problems, instead of solving the problem directly. These methods are often driven by mathematical statistics, numerical optimization methods, mathematical analysis, probability theory, graph theory, as well as various techniques aimed at working with digital data.

Let’s say we have an algorithm that allows us to trade on an exchange. The algorithm does not know about the existence of the exchange, traders, brokers, etc. It is just a maths model that has been trained to trade on hundreds of thousands of examples.

Likewise, the algorithm that drives a self-driving car has no idea of what a car is, a road, an engine, how all these work together, and so on. The algorithm is trained on a large number of examples of how to solve a particular problem but does not possess the ability to go beyond the framework of a previously formulated problem.

Machine learning algorithms are software implementations of a particular maths model. This model “learns” to solve a particular problem on a large amount of data, to find the patterns in the data to solve a particular task. The main part of this article focuses on ML, the principles behind it and its implementation in projects.

Data science is a broad notion that stands for a field, a profession that focuses on working with data. Data Scientist can stand for a person, working with databases, or someone who designs machine learning algorithms, as well as someone who maintains data managing infrastructure.

“Data science” is as broad of a notion as, for example, “Сomputer science” is.

Now, after reviewing the key notions, a question cries out for an answer: “Why do we need ML?”.

We will consider this in the next part (part 3) of our article.

AM-BITS supported by Cloudera – Technology partner of the UAFIN.TECH 2021 conference

We would like to invite our friends and colleagues to the joint presentation of AM-BITS and Cloudera representatives during the conference UAFIN.TECH 2021, which will take place on March, 8 in Parkoviy Exhibition Center. UAFIN.TECH 2021 is a unique concentration of leading experts, investors, bankers and top managers of the largest companies. The joint discussion will take place in the stream “Future technologies”.

For the first time in Ukraine, live, representatives of Cloudera will tell about new trends and together with the CEO of AM-BITS, Mr. Manzhulyanov, will share their experience and successful cases in the financial sector.

We would like to remind you that AM-BITS is the only Cloudera partner in Ukraine with the status of Cloudera Silver Partner, which confirms the presence of an expert team and relevant experience.

See you there!

———–

Cloudera is an American company, the developer of the most complete and comprehensive set of software products for working with big data. The comprehensive platform provides tools for working with data for each phase of the life cycle of the data, and provides all the requirements for working with sensitive data, including data security, data management, machine learning, analytics, etc. All of the tools are optimized for a frictionless and hybrid infrastructure. 9 of the 10 largest banks in the world work with Cloudera.

Machine Learning. Part 1

Introduction

The notion of machine learning is truly omnipresent nowadays. ML has firmly taken its place in the news trends, the job market and business automation. The success stories of “Artificial Intelligence” implementation are truly ample, and Data Scientist has become “The Sexiest Job of the 21st Century”.

That being said, despite its immense popularity, ML is still rather difficult to comprehend, due to its complexity, novelty, and booming growth (which gives rise to a bunch of myths), and there are plenty of open questions for people trying to understand this topic.

The main goal of this series of articles is to describe, in simple language, what Machine learning is, where and how it is applied, what tasks it solves, and to dispel several relevant myths, while the main intended takeaway is the basic concepts that are necessary for implementing a successful ML project.

In this part of the article, we will consider the question of what machine learning is, give basic definitions, list the stages that implementing an Machine Learning project consists of, and see what tasks can be solved using ML.

We will start with basic concepts and gradually delve deeper into the essentials, but first – let’s consider the following example.

Let’s consider an agency that resells used cars, and there is a need to predict its value on the market (resell price), based on given information about the car. The company needs to analyze a large number of ads from various sites in order to first respond to profitable offers (in literally less than a second after the proposal appears); however, since the number of ads per day on various resources is huge, it is almost impossible to track them manually.

In order to solve this issue, the company needs to develop a software assistant that would quickly grep through the ads and find suitable ones. It would predict the car prices on the secondary market, and if the expected market price for a particular car is higher than the proposal, the ad should be sent to an expert for consideration. (Learn more about case ).

To solve the problem is needed:

Articulate the problem (to build an algorithm for predicting the price of a car on the secondary market based on the car properties).
Collect the vehicle data, stored on ad sites. Based on this data, we will train the algorithm and make predictions.
Do data preprocessing (bring data into tabular form, clean, enrich data, process missing data).
Build a predictive model.
Develop a software infrastructure for this task and integrate our algorithm from point 4 into it.

After implementing the above steps, we get a program that collects cars sale ads from ad sites, analyzes them and only sends the likely financially profitable ones for expert consideration.

As you can see, ML is not a magic wand that solves any problem out-of-the-box, but a complex tool that requires thoughtful integration, not to mention the process of researching and developing the Machine Learning algorithms themselves.

This example shows how ML can be used to automate business processes, and more importantly, it demonstrates the basic steps (1-5) of developing an ML project. It should be kept in mind that despite the seeming simplicity, the implementation of an ML project is a complex and rather difficult task.

So, let’s move to the further parts of this article to get a deeper understanding of what Machine Learning is.

We offer you to read the implemented ML projects by AM-BITS: https://am-bits.com/solutions/analytics-projects

How to migrate from Hortonworks Data Platform to Cloudera Data Platform?

Why migrate from HDP/CDH?

In 2019 Cloudera presented a new platform – Cloudera Data Platform – as a universal solution that allows for data management in any environment: Public Cloud, bare metal, Private Cloud, and Hybrid Cloud.

According to the new development strategy, presented by Jan Kunigk, Cloudera operation CTO in EMEA, and Florian von Walter, chief manager of Cloudera engineering solutions in Germany, Austria, Eastern Europe and Russia – newspaper “Storage News” № 1 (76), 2020 – on-premise Hadoop based solutions development is the first stage, followed by migration to Public Cloud, and, finally, to Hybrid Cloud.

Following the new Cloudera strategy, it is recommended to migrate from platforms CDH – Cloudera Distribution of Hadoop – and HDP – Hortonworks Data Platform – to CDP, since CDH and HDP support ends on December 31, 2021, which means that respective products will not receive updates, and there will be no maintenance options available for them. Consequently, enterprise clients need to migrate from HDP/CDH to the actual stack, to maintain their solutions’ operability.

Why CDP?

HDP and CDH users are recommended to migrate to the actual Cloudera stack since Cloudera features the most complete set of tools for working with corporate data:

Cloudera Data Platform – a platform for collecting and storing data, for building EDW, EDH.
Edge & Flow Management – for end devices managing, controlling and monitoring.
Streams Messaging – for real-time delivery of large volumes of data.
Data Science Workbench – for data analysis and solving AI/ML problems.
Cloudera Manager – cluster management subsystem.
Cloudera also provides a full set of tools for solving Data Security, Data Management and Data Governance problems.
Full-fledged technical maintenance is available for Cloudera solutions.

Compare functionality and components of the listed platforms.

In contrast to previous HDP, CDP is not available as an open free distribution, however, additional functionality and tools make Cloudera stack the most convenient and economically effective tool for building Hadoop-based enterprise solutions.

Getting ready for migration

Cloudera provides detailed instructions for organizing the migration process, featuring several scenarios:

Trial CDP versions are available for different environments:

48h Cloudera platform test-drive in the cloud
a free trial version of CDP Private Cloud for familiarization and testing

CDP Upgrade Advisor, which contains detailed recommendations per specific clusters, is also available

Integration process

Choose your CDP migration variant: either full upgrade or migration with fail-proof operation requirements.
Check upgrade requirements and satisfy all prior conditions.
Choose target environment:
- CDP Public Cloud is recommended by Cloudera for systems with up to 50 nodes:
  - HDP to CDP Public Cloud
  - CDH to CDP Public Cloud
- CDP Private Cloud is recommended by Cloudera for systems with more than 50 nodes:
  - HDP to CDP Private Cloud
  - CDH to CDP Private Cloud
- CDP on-premise is recommended for customers, who due to legislation or corporate policy, do not consider migration to the cloud.
Setup, migrate, test and confirm.

Migration plan example:

Migration of the dev environment from Hortonworks stack (HDP / HDF) to Cloudera (СDP / CDF) stack

2 weeks

1.1

Cleaning the test environment and preparing infrastructure and security requirements.

1.2

Setting up and configuring the CDP dev environment.

1.3

Transferring developments and data from the HDP/HDF dev environment to the CDP dev environment.

1.4

Testing and tuning the CDP dev environment.

Розширення кластера TEST і міграція зі стека Hortonworks (HDP / HDF) на стек Cloudera (СDP / CDF)

2 weeks

2.1

Cleaning the HDP/HDF dev environment.

2.2

Setting up and configuring the CDP test environment.

2.3

Transferring developments and data from the CDP dev environment to the CDP test environment.

2.4

Testing and tuning the CDP test environment.

Building the prod cluster on Cloudera (СDP / CDF) stack

3 weeks

3.1

Cleaning the HDP/HDF prod environment.

3.2

Setting up and configuring the CDP prod environment.

3.3

Transferring developments and data from the HDP/HDF prod environment to the CDP prod environment.

3.4

Testing and tuning the CDP prod environment.

LLC “AM-BITS” is a direct Cloudera partner (Silver Partner) and has a dedicated Big Data team of 15 highly qualified architects and engineers, including 7 Hortonworks and Cloudera certified specialists. AM-BITS has 5 years of experience in building Big Data Hadoop-based solutions for enterprise clients (including projects for international banks, telecom operators and media companies).

We are ready to develop a strategy for building an enterprise data platform per the international best practices and to conduct a migration to or implementation of Cloudera Data Platform, with fail-proof services operation, and also, when the migration/implementation project is complete, to provide technical solution maintenance both remote and on-site.