Credit Card Fraud Detection

A MINOR PROJECT REPORT On “CREDIT CARD FRAUD DETECTION” Submitted To: CHHATTISGRAH SWAMI VIVEKANAND TECHNICAL UNIVERS

Views 201 Downloads 7 File size 2MB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Credit Card Hacking - 2

Technical Report UCAM-CL-TR-560 ISSN 1476-2986 Number 560 Computer Laboratory Decimalisation table attacks for PIN c

25 0 191KB Read more

rapture-credit-card

The following is my take on an idea from Robert Watkins I first published in his and Lance Sinclair's first release to t

23 2 90KB Read more

1045 Credit Card

1045 Credit Card with money search APR 8 They CC are from Checker {x-blacks.com} LIVE CC Number: 4124215728218873, Type

7 0 65KB Read more

Credit Card Visa Hack

Technical Report UCAM-CL-TR-560 ISSN 1476-2986 Number 560 Computer Laboratory Decimalisation table attacks for PIN c

39 0 223KB Read more

Contrato Credit Card Monografia Terminada

UNIVERSIDAD NACIONAL FEDERICO VILLARREAL CONTRATOS CREDIT CARD FACULTAD DE CIENCIAS ECONOMICAS CONTRATO CRED 1 CONT

19 0 2MB Read more

Credit Card (TARJETA DE CREDITO)

57 0 340KB Read more

CFPB Proposed Credit Card Applicaiton

24 0 213KB Read more

Credit Card Authorization - Card on File-2-2

CREDIT CARD AUTHORIZATION CREDIT CARD TYPE: DISCOVER VISA xxxxxxxxxxx MASTERCARD AMEX CREDIT CARD 4741 6590 0368 91

113 38 145KB Read more

Random Credit Card Numbers Generator - NamsoGen.pdf

NamsoGen Random Credit Card Numbers Generator Generate random credit card numbers for testing, validation and/or verifi

12 0 128KB Read more

3 Ways To Cashout A Credit Card

273 0 19KB Read more

Author / Uploaded
dolly gohil

Citation preview

A MINOR PROJECT REPORT

On

“CREDIT CARD FRAUD DETECTION” Submitted To:

CHHATTISGRAH SWAMI VIVEKANAND TECHNICAL UNIVERSITY, BHILAI For Fulfilment of the Award of the Degree

Bachelor of Engineering, VIIth Semester In COMPUTER SCIENCE & ENGINEERING Submitted By:-

Guided By:-

Dolly Gohil (BE3073)

Dr. Sudhir Kumar Meesala

G. Shivani Murthy (BE3097)

Asst. Prof., Department of CSE

Osheen Arya (BE3780)

CHOUKSEY ENGINEERING COLLEGE, BILASPUR (C.G.) DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Session:- 2020-21

DECLARATION BY THE CANDIDATES We the undersigned solemnly declare that the report of the project work entitled “CREDIT CARD FRAUD DETECTION” is based on our own work carried out during the course of our Minor Project Lab under the g u i d a n c e of Mr. Bharat Choudhary, Asst. Professor, Department of Computer Science & Engineering, Chouksey Engineering College, Bilaspur (CG).

I further declare that the statements made and conclusions drawn are an outcome of our project work.

_______________ (Signature of the Candidate) Dolly Gohil Enrollment No.: BE3073

________________ (Signature of the Candidate) G. Shivani Murthy Enrollment No.: BE3097

_________________ (Signature of the Candidate) Osheen Arya Enrollment No.: BE3780

ii | P a g e

CERTIFICATE BY THE GUIDE This is to certify that the project entitled “CREDIT CARD FRAUD DETECTION” is a record of work carried out by Dolly Gohil Enrollment No.: BE3073, G. Shivani Murthy Enrollment No.: BE3097, Osheen Arya Enrollment No.: BE3780, bearing Under my guidance and supervision for the award of Degree of Bachelor of Engineering, Chhattisgarh Swami Vivekananda Technical University, Bhilai (C.G.), India. To the best of my knowledge and belief the Project i.

Embodies the work of the candidate him/herself.

ii.

Has not been submitted for the award of any degree.

iii. Fulfils the requirement of the Ordinance relating to the B.E degree of the University and, iv. Is up to the desired standard in respect of contents and is being referred to the examiners.

(Signature of the Guide) Prof. Bharat Choudhary Asst. Prof. Dept of CSE CEC, Bilaspur (CG) Recommendation The Project work as mentioned above is here by being recommended and forwarded for examination and evaluation.

(Signature of the Project In-Charge) Dr. Sudhir Kumar Meesala Project In-charge Asst. Prof. Dept of CSE CEC, Bilaspur (C.G.)

(Signature of the HOD) with seal) Mr. Akhilesh Sharma Asst. Prof. HOD, Department of CSE CEC, Bilaspur (C.G.)

iii | P a g e

CERTIFICATE BY THE EXAMINERS This is to certify that the project work entitled “CREDIT CARD FRAUD DETECTION Submitted by student Dolly Gohil Enrollment No.: BE3073, G. Shivani Murthy Enrollment No.: BE3097, Osheen Arya Enrollment No.: BE3780, have been completed under the guidance of Prof. Bharat Choudhary, (Asst. Prof.) Department of Computer Science & Engineering, Chouksey Engineering College, Bilaspur(CG) has been examined by the undersigned as a part of the examination for the award of Bachelor of Engineering

degree in Computer Science & Engineering branch in Chouksey

Engineering College, Bilaspur of Chhattisgarh Swami Vivekanand Technical University, Bhilai(CG). “Project Examined and Approved”

Internal Examiner Date:

External Examiner Date:

(Signature of the Principal with seal) Principal Chouksey Engineering College, Bilaspur (C.G)

iv | P a g e

ACKNOWLEDGEMENT At every outset I express my gratitude to almighty lord for showering his grace and blessing upon me to complete this project. Although my name appears on the cover of this project, many people had contributed in some form or the other form to this project development. I could not do this project without the assistance or support of each of the following we thank you all. I wish to place on my record my deep sense of gratitude to my project guide, Prof. Bharat Choudhary, (Asst. Prof.) Dept. of C.S.E, C.E.C Bilaspur and my project in charge, Dr. Sudhir Kumar Meesala, (Asst. Prof.) Dept. of CSE, CEC Bilaspur for their constant motivation and valuable help through the project work. Express my gratitude to Mr. Akhilesh Sharma, H.O.D. of C.S.E, C.E.C Bilaspur for his valuable suggestions and advices throughout the course. I also extend our thanks to other faculties for their cooperation during my course.

Finally, I would like to thank my friends for their cooperation to complete this project.

_______________ (Signature of the Candidate) Dolly Gohil Enrollment No.: BE3073

________________ (Signature of the Candidate) G. Shivani Murthy Enrollment No.: BE3097

_________________ (Signature of the Candidate) Osheen Arya Enrollment No.: BE3780

v|Page

ABSTRACT Finance fraud is a growing problem with far consequences in the financial industry and while many techniques have been discovered. The use of credit cards is of paramount importance in improving the economic strength of any nation, however, fraudulent activities associated with it is of great concern. When fraud occurs on credit cards, the negative impact is huge as the financial loss experienced cuts across all the parties involved. As the payment method is simplified by the combination of the financial industry and IT technology, the payment method of consumers is changing from cash payment to electronic payment using credit card, mobile micropayment, and app card. As a result, the number of cases in which anomalous transactions are attempted by abusing e-banking has increased and financial companies started establishing a Fraud Detection System (FDS) to protect consumers from abnormal transactions. The abnormal transaction detection system aims to identify abnormal transactions with high accuracy by analysing user information and payment information in real time. Although FDS has shown good results in reducing fraud, but the majority of cases being flagged by this system are False Positives that resulting in substantial investigation costs and cardholder inconvenience. The possibilities of enhancing the current operation constitute the objective of this research. Based on variations and combinations of testing and training class distributions, experiments were performed to explore the influence of these parameters. In this study, we investigated the trend of abnormal transaction detection using payment log analysis and data mining, and summarized the data mining algorithm used for abnormal credit card transaction detection. We used Python programming with apache spark for advanced processing of data and high accuracy. Credit Card Fraud can be defined as a case where a person uses someone else’s credit card for personal reasons while the owner and the card-issuing authorities are unaware of the fact that the card is being used. Due to the rise and acceleration of E-Commerce, there has been a tremendous use of credit cards for online shopping which led to High amount of frauds related to credit cards. In the era of digitalization, the need to identify credit card frauds is necessary. Fraud detection involves monitoring and analyzing the behaviour of various users to estimate detect or avoid undesirable behaviour.

Keywords:- Credit Card, Fraud Detection, Outlier Detection, GBT Classifier, E-Commerce, Python.

vi | P a g e

TABLE OF CONTENTS Contents

Page No.

Declaration by the Candidates

ii

Certificate by the Guide

iii

Certificate by the Examiners

iv

Acknowledgement

v

Abstract

vi

Table of Contents

vii-viii

List of Figures

ix-x

List of Tables

xi

List of Symbols

xii

List of Abbreviations

xiii

Chapter- 01: Introduction

1-16

1.1- Credit Card Fraud Detection

1-2

1.2- Machine Learning

2-4

1.3- Data Science

4-7

1.4- Problem Definition

7-8

1.5- Proposed Work/Algorithm

8-10

1.6- Objectives of Credit Card Fraud Detection

10

1.7- Scope of Credit Card Fraud Detection

11

1.8- Organization of Credit Card Fraud Detection

11-16

Chapter- 02: Literature Survey/System Design

17-19

2.1- Literature Survey

17

2.2- System Design

18-19

Chapter- 03: Proposed Work

20-22

3.1- Proposed System

20

3.2- Functional Requirements

20

3.3- Non-Functional Requirements

20-21

3.4- Related Work

21-22

3.5- Proposed Technique

22

Chapter- 04: Methodology

23-33

4.1- Method

23-27

4.2- Algorithms

27-30

4.3- Use Case Diagram

31

4.4- Data Flow Diagram (DFD)

32-33 vii | P a g e

Chapter- 05: Implementation

34-37

5.1- Implementation

34

5.2- System Implementation

34-36

5.3- System Requirements

36

5.4- Testing

36

5.5- Results

37

Chapter- 06: Experimental Results

38-43

6.1- Steps to Execute the Project

38

6.2- Snap Shots of the Project

38-43

Chapter- 07: Conclusion and Future Enhancements

44

7.1- Conclusion

44

7.2- Future Enhancements

44

References

45

Appendixes

46-67

Appendix- ‘A’ (Progress Monitoring Report)

46-47

Appendix- ‘B’ (Source Code)

48-59

Appendix- ‘C’ (About Authors)

60-67

viii | P a g e

LIST OF FIGURES S. No.

List of Figures

Page No.

1.

Fig:- Data Science

4

2.

Fig:- Life Cycle of Data Science

5

3.

Fig:- Credit Card Fraud Statistics

12

4.

Fig:- No. of Worldwide Non-Cash Transactions

15

5.

Fig:- System Architecture of CCFDS

18

6.

Fig:- Basic Outline Architecture

23

7.

Fig:- Full Architectural Diagram

23

8.

Fig:- No. of Fraudulent and Non-Fraudulent Cases

24

9.

Fig:- Graphical Representation of Transaction (Time Feature)

24

10.

25

11.

Fig:- Graphical Representation of Transaction (Monetary Value Feature) Fig:- Processes of CCFD

12.

Fig:- Logistic Curve

27

13.

Fig:- SVM Model Graph

28

14.

Fig:- Decision Tree

29

15.

Use Case Diagram

31

16.

Data Flow Diagram (DFD)

32

17.

Fig:- Evaluation of LR Algorithm

38

18.

Fig:- Evaluation of LDA Algorithm

39

19.

Fig:- Evaluation of KNN Algorithm

39

20.

Fig:- Evaluation of DTC Algorithm

39

21.

Fig:- Evaluation of SVM Algorithm

40

22.

Fig:- Evaluation of RF Algorithm

40

23.

Fig:- Testing of LR Algorithm

40

24.

Fig:- Testing of LDA Algorithm

41

25.

Fig:- Testing of KNN Algorithm

41

26.

Fig:- Testing of DTC Algorithm

42

27.

Fig:- Testing of SVM Algorithm

42

28.

Fig:- Testing of RF Algorithm

43

29.

Fig:- B-1

50

30.

Fig:- B-2

50

25

ix | P a g e

31.

Fig:- B-3

51

32.

Fig:- B-4

51

33.

Fig:- B-5

52

34.

Fig:- B-6

53

35.

Fig:- B-7

53

36.

Fig:- B-8

54

37.

Fig:- B-9

55

38.

Fig:- B-10

56

39.

Fig:- B-11

56

40.

Fig:- B-12

57

41.

Fig:- B-13

58

x|Page

LIST OF TABLES S. No.

List of Tables

Pg. No.

1.

The Major Types of Credit Card Fraud

13

2.

Testing

36

3.

Progress Monitoring Report

46

xi | P a g e

LIST OF SYMBOLS

Data Flow Symbols

Defines the Source and Destination of data

Identifies data flow

Database

xii | P a g e

LIST OF ABBREVIATIONS 1. CCFD: Credit Card Fraud Detection. 2. IDE: Short for “Integrated Development Environment”. Software application that provides comprehensive facilities to computer programmers for software development. An IDE normally consists of at least a source code editor, build automation tools and a debugger. 3. Services: Portions of code that run in the background to provide content and services to the applications.

4. ML: Machine Learning. 5. DS: Data Science. 6. LR: Logistic Regression Algorithm. 7. LDA: Linear Discriminant Algorithm. 8. KNN: K-Neighbors Classifier Algorithm. 9. DTC: Decision Tree Classifier. 10. SVM: Support Vector Machine Algorithm. 11. RF: Random Forest Classifier Algorithm. 12. CCFDS: Credit Card Fraud Detection System.

xiii | P a g e

CHAPTER- 01 INTRODUCTION 1.1 Credit Card Fraud Detection 1.1.1 Introduction ‘Fraud’ in credit card transactions is unauthorized and unwanted usage of an account by someone other than the owner of that account. Necessary prevention measures can be taken to stop this abuse and the behaviour of such fraudulent practices can be studied to minimize it and protect against similar occurrences in the future. In other words, Credit Card Fraud can be defined as a case where a person uses someone else’s credit card for personal reasons while the owner and the card issuing authorities are unaware of the fact that the card is being used. Fraud detection involves monitoring the activities of populations of users in order to estimate, perceive or avoid objectionable behaviour, which consist of fraud, intrusion, and defaulting.

This is a very relevant problem that demands the attention of communities such as machine learning and data science where the solution to this problem can be automated. This problem is particularly challenging from the perspective of learning, as it is characterized by various factors such as class imbalance. The number of valid transactions far outnumber fraudulent ones. Also, the transaction patterns often change their statistical properties over the course of time.

These are not the only challenges in the implementation of a real-world fraud detection system, however. In real world examples, the massive stream of payment requests is quickly scanned by automatic tools that determine which transactions to authorize. Machine learning algorithms are employed to analyse all the authorized transactions and report the suspicious ones. These reports are investigated by professionals who contact the cardholders to confirm if the transaction was genuine or fraudulent. The investigators provide a feedback to the automated system which is used to train and update the algorithm to eventually improve the fraud-detection performance over time.

1.1.2 Types of Fraud Fraud detection methods are continuously developed to defend criminals in adapting to their fraudulent strategies. These frauds are classified as:-

• Card Theft 1|Page

• Account Bankruptcy • Device Intrusion • Application Fraud • Counterfeit Card • Telecommunication Fraud • Credit Card Frauds: Online and Offline 1.1.3 Approaches to Detect Frauds Some of the currently used approaches to detection of such fraud are:•

Artificial Neural Network

•

Fuzzy Logic

•

Genetic Algorithm

•

Logistic Regression

•

Decision tree

•

Support Vector Machines

•

Bayesian Networks

•

Hidden Markov Model

•

K-Nearest Neighbour

1.2 Machine Learning 1.2.1 Introduction Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.

2|Page

Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics. Simple Definition: Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

1.2.2 History The term machine learning was coined in 1959 by Arthur Samuel, an American IBMer and pioneer in the field of computer gaming and artificial intelligence. A representative book of the machine learning research during the 1960s was the Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification. Interest related to pattern recognition continued into the 1970s, as described by Duda and Hart in 1973. In 1981 a report was given on using teaching strategies so that a neural network learns to recognize 40 characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal. Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." This definition of the tasks in which machine learning is concerned offers a fundamentally operational definition rather than defining the field in cognitive terms. This follows Alan Turing's proposal in his paper "Computing Machinery and Intelligence", in which the question "Can machines think?" is replaced with the question "Can machines do what we (as thinking entities) can do?".

1.2.3 Machine Learning Approaches Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal" or "feedback" available to the learning system: •

Supervised Learning: The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.

•

Unsupervised Learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

3|Page

•

Reinforcement Learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries to maximize.

1.3 Data Science 1.3.1 Introduction Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data. Data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. Data science is all about using data to solve problems. The problem could be decision making such as identifying which email is spam and which is not. So, the core job of a data scientist is to understand the data, extract useful information out of it and apply this in solving the problems. Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.

Fig:- Data Science 4|Page

A Data Analyst usually explains what is going on by processing history of the data. On the other hand, Data Scientist not only does the exploratory analysis to discover insights from it, but also uses various advanced machine learning algorithms to identify the occurrence of a particular event in the future. A Data Scientist will look at the data from many angles, sometimes angles not known earlier. So, Data Science is primarily used to make decisions and predictions making use of predictive causal analytics, prescriptive analytics (predictive plus decision science) and machine learning. •

Predictive causal analytics

•

Prescriptive analytics

•

Machine learning for making predictions

•

Machine learning for pattern discovery

1.3.2 Life Cycle of Data Science Phase 1—Discovery: Before you begin the project, it is important to understand the various specifications, requirements, priorities and required budget. You must possess the ability to ask the right questions. Here, you assess if you have the required resources present in terms of people, technology, time and data to support the project. In this phase, you also need to frame the business problem and formulate initial hypotheses (H) to test.

Fig:- Life Cycle of Data Science

5|Page

Phase 2—Data Preparation: In this phase, you require analytical sandbox in which you can perform analytics for the entire duration of the project. You need to explore, pre-process and condition data prior to modeling. Further, you will perform ETLT (extract, transform, load and transform) to get data into the sandbox. Let’s have a look at the Statistical Analysis flow below:-

You can use R for data cleaning, transformation, and visualization. This will help you to spot the outliers and establish a relationship between the variables. Once you have cleaned and prepared the data, it’s time to do exploratory analytics on it. Phase 3—Model Planning: Here, you will determine the methods and techniques to draw the relationships between variables. These relationships will set the base for the algorithms which you will implement in the next phase. You will apply Exploratory Data Analytics (EDA) using various statistical formulas and visualization tools. Phase 4—Model Building: In this phase, you will develop datasets for training and testing purposes. Here you need to consider whether your existing tools will suffice for running the models or it will need a more robust environment (like fast and parallel processing). You will analyze various learning techniques like classification, association and clustering to build the model. Phase 5—Operationalize: In this phase, you deliver final reports, briefings, code and technical documents. In addition, sometimes a pilot project is also implemented in a real-time production environment. This will provide you a clear picture of the performance and other related constraints on a small scale before full deployment. Phase 6—Communicate Results: Now it is important to evaluate if you have been able to achieve your goal that you had planned in the first phase. So, in the last phase, you identify all the key findings, communicate to the stakeholders and determine if the results of the project are a success or a failure based on the criteria developed in Phase 1.

1.3.3 Advantages of Data Science There are several advantages of data science:•

It’s in demand

•

Abundance of positions 6|Page

•

A highly paid career

•

Data science is versatile

•

Data science makes products smarter

•

Data science makes data better

1.3.4 Disadvantages of Data Science The disadvantages of data science are:•

Data science is blurry term

•

Mastering data science is near to impossible

•

Large amount of domain knowledge required

•

Arbitrary data may yield unexpected results

•

Problem of data privacy

1.4 Problem Definition

Credit card frauds are increasing heavily because of fraud financial loss is increasing drastically. Every year due to fraud Billions of amounts lost. To analyze the fraud there is lack of research. Many machine learning algorithms are implemented to detect real world credit card fraud. ANN and hybrid algorithms are applied. The Credit Card Fraud Detection Problem includes modeling past credit card transactions with the knowledge of the ones that turned out to be a fraud. This model is then used to identify whether a new transaction is fraudulent or not. Our aim here is to detect 100% of the fraudulent transactions while minimizing the incorrect fraud classifications. Credit card fraud stands as major problem for word wide financial institutions. Annual lost due to it scales to billions of dollars. We can observe this from many financial reports. Such as (Bhattacharyya et al., 2011) 10th annual online fraud report by CyberSource shows that estimated loss due to online fraud is $4 billion for 2008 which is 11% increase than $3.6 billion loss in 2007and in 2006, fraud in United Kingdom alone was estimated to be £535 million in 2007 and now costing around 13.9 billion a year (Mahdi et al., 2010). From 2006 to 2008, UK alone has lost £427.0 million to £609.90 million due to credit and debit card fraud (Woolsey & Schilz, 2011). Although, there is some decrease in such losses after implementation of detection and prevention systems by government and bank, card-not-present fraud losses are increasing at higher rate due to online transactions. Worst thing is it is still increasing unprotective and un-detective way.

7|Page

Over the year, government and banks have implemented some steps to subdue these frauds but along with the evolution of fraud detection and control methods, perpetrators are also evolving their methods and practices to avoid detection. Thus an effective and innovative methods need to be develop which will evolve accordingly to the nee

1.5 Proposed Work/Algorithm A mechanism is developed to determine whether the given transaction is fraud or not. Enormous Data is processed every day and the model build must be fast enough to respond to the scam in time. Imbalanced Data i.e. most of the transactions(99.8%) are not fraudulent which makes it really hard for detecting the fraudulent ones Data availability as the data is mostly private. Misclassified Data can be another major issue, as not every fraudulent transaction is caught and reported. And last but not the least, Adaptive techniques used against the model by the scammers.

1.5.1 Algorithm Proposed Data science encompasses a large collection of algorithm and techniques used in classification, regression, clustering or anomaly detection. The algorithms are used in this credit card fraud detection are as follows: ➢ K-Nearest Neighbor algorithms (KNN) ➢ Support Vector Machine (SVM) ➢ Random Forest Algorithm (RF) ➢ Logistic Regression (LR) ➢ Naïve Bayes ➢ Decision Trees Genetic algorithms and other algorithms. Algorithms are often recommended as predictive methods as a means of detecting fraud. One algorithm that has been suggested by Bentley et al. (2000) is based on genetic programming in order to establish logic rules capable of classifying credit card transactions into suspicious and non-suspicious classes 8|Page

Basically, this method follows the scoring process In the experiment described in their study, the database was made of 4,000 transactions with 62 fields. As for the similarity tree, training and testing samples were employed. Different types of rules were tested with the different fields. The best rule is the one with the highest predictability. Their method has proven results for real home insurance data.

1.5.2 Some Definitions The following are essential definitions – in the current problem’s context – needed to understand the approaches mentioned later: True Positive: The fraud cases that the model predicted as ‘fraud.’ •

False Positive: The non-fraud cases that the model predicted as ‘fraud.’

•

True Negative: The non-fraud cases that the model predicted as ‘non-fraud.’

•

False Negative: The fraud cases that the model predicted as ‘non-fraud.’

•

Threshold Cutoff Probability: Probability at which the true positive ratio and true negatives ratio are both highest. It can be noted that this probability is minimal, which is reasonable as the probability of frauds is low.

•

Accuracy: The measure of correct predictions made by the model – that is, the ratio of fraud transactions classified as fraud and non-fraud classified as non-fraud to the total transactions in the test data.

•

Sensitivity: Sensitivity, or True Positive Rate, or Recall, is the ratio of correctly identified fraud cases to total fraud cases.

Specificity: Specificity, or True Negative Rate, is the ratio of correctly identified non-fraud cases to total non-fraud cases. •

Precision: Precision is the ratio of correctly predicted fraud cases to total predicted fraud cases.

1.5.3 Incorrect Measures of Efficiency of a Data Model Let’s look at the various measures of efficiency that fail at analysing the correctness of the underlying data model. Total/Net Accuracy: One approach to gauge the compute model’s correctness is to use Accuracy as the deciding parameter. But, as stated earlier, in a highly skewed data set like this, we know that even if we predict all values as non-fraudulent, we’ll have only 492 wrong predictions out of 284,807 in total. So, 9|Page

the accuracy is excellent, but it still doesn’t solve our problem as we want to identify as many fraud cases as possible. So, we can’t use accuracy as a deciding factor here. Confusion Matrix: Merely tabulating the confusion matrix will not provide a clear understanding of the performance of the data. This is because the total number of fraud cases is much less, and variation in the confusion matrix will be so small that it will be equivalent to a justified error in a balanced dataset (probably even less!). So, this measure is also ruled out.

1.6 Objective of Credit Card Fraud Detection The objectives of credit card fraud detection are to reduce losses due to payment fraud for both merchants and issuing banks and increase revenue opportunities for merchants. Credit card fraud detection is challenging task for the user. Online payment does not require physical card. And if anyone who know the details of card can make the transactions. Currently, card holder comes to know only after the fraud transaction is carried out. no proper mechanism are there to track the fraud transaction. The overall objective of this project is:➢ To reduce number of fraud transaction. ➢ To use credit card safely for online transaction. ➢ To add layer of security. They program their computers to monitor charges. While the exact things that will trigger a fraud alert are probably secret, I can, from experience, identify a few. They build a profile of you from your past charges. If your charges are all less than $100 and at places within a few miles of your billing address, then a $1,000 charge from the other side of the country will look suspicious, and probably will be declined. Then you will get a phone call or text asking if that was you. In fact, a much larger than usual charge at your home town may get declined, especially if you only recently got the card - they have not had enough time and charges to build the profile. Bear this in mind if you are going to do something new and different with your card, like travel, especially out of the country. If you are going to make some purchases that are much more than you have spent in the past, expect the first one to be declined until you satisfy the bank that it is you. You can avoid some of this by calling the bank before you travel, or buy that diamond thingy, to let them know.

10 | P a g e

1.7 Scope of Credit Card Fraud Detection We designed a system to detect fraud in Credit Card transactions. This system is capable of providing most of the essential features required to detect fraudulent and legitimate transactions. As technology changes, it becomes difficult to track the behaviour and pattern of fraudulent transactions. We have just detected the fraudulent activity but we have not prevented. Preventing known and unknown fraud in real time is not easy but it is feasible. The proposed architecture is basically designed to detect credit card fraud in online payments, and emphasis is made to provide a fraud prevention system to verify a transaction as fraudulent or legitimate. For implementation purposes it is assumed that issuer and acquirer bank is connected to each other. If this system is to be implemented in real time scenario then exchange of best practices and raising consumer awareness among people can be very helpful in reducing the losses caused by fraudulent transactions. Further enhancement can be done by making this system secure with the use of certificates for both merchant and customer and as technology changes new checks can be added to understand the pattern of fraudulent transactions and to alert the respective card holders and bankers when fraud activity is identified. The dataset available on day to day processing may become outdated, it is necessary to have updated data for effective fraud behaviour

identification. To this extent,

the incremental approach is necessary in making the system to learn from past as well as present data and capable of handling the both. Fraudster uses different new techniques that are instantaneously growing along with new technology makes it difficult for detection. Also the nature of access pattern may vary from one geographical location. to another (such as urban and rural areas) that may result in a false positive detection. In such a case a future enhancement may be based on new multiple models with varying access pattern needs attention to improve the effectiveness. Privacy preserving techniques applied in distributed environment resolves the security related issues preventing private data access.

1.8 Organization of Credit Card Fraud Detection From the Moment the Payment Systems came to existence, there have always been people who will find new ways to access someone’s finances illegally. This has become a major problem in the modern era, as all transactions can easily be completed online by only entering your credit card information. Even in the 2010s, many American retail website users were the victims of online transaction fraud right before twostep verification was used for shopping online. Organizations, Consumers, Banks, and Merchants are put at risk when a data breach leads to monetary theft and ultimately the loss of customer’s loyalty along with the company’s reputation. Unauthorized card operations hit an astonishing amount of 16.7 million victims in 2017. Additionally, as reported by the Federal Trade Commission (FTC), the number of credit card fraud claims in 2017 was

11 | P a g e

40% higher than the previous year’s number. There were around 13,000 reported credit card fraud cases in California and 8,000 in Florida, which are the largest states per capita for such type of crime. The amount of money at stake will exceed approximately $30 billion by 2020. Here are some most common credit card frauds statistics:

Fig:- Credit Card Fraud Statistics

1.8.1 What is Credit Card Fraud Detection Fraud detection is a set of activities that are taken to prevent money or property from being obtained through false pretenses. Fraud can be committed in different ways and in many industries. The majority of detection methods combine a variety of fraud detection datasets to form a connected overview of both valid and non-valid payment data to make a decision. This decision must consider IP address, geolocation, device identification, “BIN” data, global latitude/longitude, historic transaction patterns, and the actual transaction information. In practice, this means that merchants and issuers deploy analytically based responses that use internal and external data to apply a set of business rules or analytical algorithms to detect fraud. Credit Card Fraud Detection with Machine Learning is a process of data investigation by a Data Science team and the development of a model that will provide the best results in revealing and preventing fraudulent transactions. This is achieved through bringing together all meaningful features of 12 | P a g e

card users’ transactions, such as Date, User Zone, Product Category, Amount, Provider, Client’s Behavioral Patterns, etc. The information is then run through a subtly trained model that finds patterns and rules so that it can classify whether a transaction is fraudulent or is legitimate. Now you know what is fraud protection, let’s look at the most common types of threats.

1.8.2 The Major Types of Credit Card Fraud Business fraud protection is a very significant issue in many industries. The number of reported cases shows the top-level importance of Fraud Protection for Credit Cards.

Rank

Category

# of Reports

1

Internet Services

62,942

2

Credit Cards

51,129

3

Healthcare

47,410

4

Television and Electronic Media

38,336

5

Foreign Money Offers and Counterfeit Check Scams

27,443

6

Computer Equipment and Software

18,350

7

Investment-Related

14,884

1.8.3 Clone Transactions Clone transactions are popular among the different types of credit card frauds. It simply means making transactions similar to an original one or duplicating a transaction. This can happen when an organization tries to get payment from a partner multiple times by sending the same invoice to different departments.

The conventional method of rule-based fraud detection algorithm does not work well to distinguish a fraudulent transaction from irregular or mistaken transactions. For instance, a user could click the submission

button

two

times

by

accident

or

order

the

same

product

twice.

The better option is if a system is capable of differentiating a fraudulent transaction from one made in error. Here, Machine Learning fraud detection methods would be more potent in differentiating clone transactions caused by human error and real fraud. 13 | P a g e

1.8.5 Account Theft and Suspicious Transactions When an individual’s personal information such as a Social Security number, a secret question answer, or date of birth is stolen by criminals, they can use this information to perform financial operations. A lot of fraud transactions are linked to identity theft, so financial fraud detection systems should pay the most attention to creating an analysis of a user’s behaviour. If there is a certain regularity in the way a client makes his payments, e. g. someone visits a certain bar once a week at the same time and always spends about $40 to $60. If the same account is used to make a payment at a bar located in another part of town and for a sum of more than $60, this behavior would be considered irregular. The next move would be to send a verification request to the card number owner in order to validate that he or she made the transaction. Metrics such as standard deviation, averages, and high/low values are the most useful to spot irregular behavior. Separate payments are compared with personal benchmarks to identify transactions with a high standard deviation. Then, the best choice is to validate the account holder if such a deviation occurs.

1.8.5 False Application Fraud Credit card application fraud is often accompanied by account/identity/credit card theft. It means that someone applies for a new credit account or credit card in another person’s name. First, criminals steal the documents which will serve as supporting evidence for their fake application. Anomaly detection helps to identify whether a transaction has any unusual patterns, such as date and time or the number of goods. If the algorithm spots such unusual behavior, the owner of the bank account will be protected by a few verification methods. This is a great method for credit fraud prevention.

1.8.6 Credit Card Skimming (Electronic or Manual) Credit card skimming or credit card forgery means making an illegal copy of a credit or bank card with a device that reads and duplicates information from the original card. Credit card scammers use machines named “skimmers” to extract card numbers and other credit card information, save it, and resell to criminals. As in the case of identity theft, suspicious transactions made from a copy of an electronic or manual card will be revealed because of the information on the transaction. Classification techniques can define whether a transaction is fraudulent based on hardware, geolocation, and information about a client’s behavior patterns.

1.8.7 Account Takeover Last but not least, among types of credit card fraud is account takeover. Fraudsters can send deceptive emails to cardholders. The messages look pretty legitimate (e.g. very similar bank URLs and trustworthy 14 | P a g e

logos), as if they were sent by the bank. In reality, such a message can be used to steal someone’s personal information, bank account numbers, and online passwords. If you click the wrong link or provide valuable information in response to a message from a fake bank website, within a couple of hours your bank account will be drained by the criminals into an account they hold. To avoid this, AI-driven solutions rely on neural networks or pattern recognition. Neural networks can learn suspicious-looking patterns as well as to detect classes and clusters to use these patterns for fraud detection.

1.8.8 How Does Credit Card Fraud Happen? Credit card fraud is usually caused either by card owner’s negligence with his data or by a breach in a website’s security. Here are some examples:

• A consumer reveals his credit card number to unfamiliar individuals. • A card is lost or stolen and someone else uses it. • Mail is stolen from the intended recipient and used by criminals. • Business employees copy cards or card numbers of its owner. •

Making counterfeit credit cards.

Fig:- No. of Worldwide Non-Cash Transactions When your card is lost or stolen, an unauthorized charge can happen; in other words, the person who finds it uses it for a purchase. Criminals can also forge your name and use the card or order some goods through a mobile phone or computer. Also, there is the problem of using a counterfeit credit card – a fake 15 | P a g e

card that has the real account information that was stolen from holders. That is especially dangerous because the victims have their real cards, but do not know that someone has copied their card. Such fraudulent cards look quite legitimate and have the logos and encoded magnetic strips of the original one. Fraudulent credit cards are usually destroyed by the criminals after several successful payments, just before a victim realizes the problem and reports it.

1.8.9 Credit Card Fraud Detection Systems Off-the-shelf fraud risk scores pulled from third parties. Predictive machine learning models that learn from prior data and estimate the probability of a fraudulent credit card transaction. Business rules that set conditions that the transaction must pass to be approved (e.g., no OFAC alert, SSN matches, below deposit/withdrawal limit, etc.).

Among these fraud analytics techniques, predictive Machine Learning models belong to smart Internet security solutions.

16 | P a g e

CHAPTER- 02 LITERATURE SURVEY/ SYSTEM DESIGN 2.1 Literature Survey Fraud act as the unlawful or criminal deception intended to result in financial or personal benefit. It is a deliberate act that is against the law, rule or policy with an aim to attain unauthorized financial benefit. Numerous literatures pertaining to anomaly or fraud detection in this domain have been published already and are available for public usage. A comprehensive survey conducted by Clifton Phua and his associates have revealed that techniques employed in this domain include data mining applications, automated fraud detection, adversarial detection. In another paper, Suman, Research Scholar, GJUS&T at Hisar HCE presented techniques like Supervised and Unsupervised Learning for credit card fraud detection. Even though these methods and algorithms fetched an unexpected success in some areas, they failed to provide a permanent and consistent solution to fraud detection. A similar research domain was presented by WenFang YU and Na Wang where they used Outlier mining, Outlier detection mining and Distance sum algorithms to accurately predict fraudulent transaction in an emulation experiment of credit card transaction data set of one certain commercial bank. Outlier mining is a field of data mining which is basically used in monetary and internet fields. It deals with detecting objects that are detached from the main system i.e. the transactions that arent genuine. They have taken attributes of customers behaviour and based on the value of those attributes theyve calculated that distance between the observed value of that attribute and its predetermined value. Unconventional techniques such as hybrid data mining/complex network classification algorithm is able to perceive illegal instances in an actual card transaction data set, based on network reconstruction algorithm that allows creating representations of the deviation of one instance from a reference group have proved efficient typically on medium sized online transaction. There have also been efforts to progress from a completely new aspect. Attempts have been made to improve the alert- feedback interaction in case of fraudulent transaction. In case of fraudulent transaction, the authorised system would be alerted and a feedback would be sent to deny the ongoing transaction. Artificial Genetic Algorithm, one of the approaches that shed new light in this domain, countered fraud from a different direction. It proved accurate in finding out the fraudulent transactions and minimizing the number of false alerts. Even though, it was accompanied by classification problem with variable misclassification costs.

17 | P a g e

2.2 System Design 2.2.1 System Architecture

Fig:- System Architecture of CCFDS

Above fig shows the process of CCFDS. This system model accepts real time customer credit card transaction database.it is more important to find fraud rate of credit card.

2.2.2 Data Collection Collect input dataset based on transaction details,

2.2.3 Data Balancing After collecting large set of databases, it is necessary to understand and separate the balanced data and unbalanced data in two types of class. class-0 indicates non-fraud and class-1 indicates fraud.

2.2.4 Feature Extraction and Selection 18 | P a g e

Class-1 indicates total fraud transactions are 492 samples. In this project v1, v2 …v28 features.

2.2.5 Outlier Detection It measures the distance between each similar data to the clustering technique. The values which are not follows the trained data consider as outlier.

2.2.6 Classification As the dataset is imbalanced, many classifiers show bias for majority classes. PySpark library is applied as a SQL-like analysis to a large amount of structured or semi-structured data. GBT Classifier does the classification of data coming through the stream.

19 | P a g e

CHAPTER- 03 PROPOSED WORK 3.1 Proposed System The key objective of current research is improvising the procedure of personal follow up on a large number of suspicious transactions and to discover a path to preprocess the flagged records to recognize the probable genuine entries from the list of genuine/falsified entries. Here, the volume of needless analysis is decreased leading to significant savings for the financial institutions. Moreover, the current FDS threshold can also be lowered and a number of fraudulent cases, being missed under this level, can be detected. As a result, the fraud is discovered earlier and the overall losses may be reduced. For addressing these challenges, outlier detection and GBT Classifier is used, i.e. among the very common used applications of Machine Learning for addressing the pattern recognition and classification problems. The results indicate that the used method has a very good possibility to improvise the present system.

3.1.1 Advantages Following are the advantages of the proposed system:-

• The proposed method overcomes the low accuracy forecast problem. • Utilizing latest AI methods, the fraudulent transactions are recognized and the false alerts are reduced.

• Fast and reliable solution is attained.

3.2 Functional Requirements Functional requirements are the characteristics of the product. All the features expected from any development are mentions as functional requirements.

3.3 Non-Functional Requirements Non-Functional requirements list out the client expectations from product design, security, accessibility, and reliability or performance viewpoint.

3.3.1 Performance Requirements Performance requirements tells about the software capability to respond on users’ action such as: 20 | P a g e

➢ Upon running the application, it shouldn’t take more than 3 seconds. ➢ Data validation shouldn’t take above 5 seconds. ➢ Result generation should be achieved within 5 seconds. Design Constraints- The project is to be developed in python which should get executed in a Windows OS. PyCharm editor should be used as an IDE. Standards Compliance- There should be uniformity while defining variable names. The GUI shall have a poleasent look and feel. The graphical user interface should be user friendly. Reliability- The product should not fail in mid of any operations carrying out. Availability- The software can be used anytime. Security- Security is very important for any application that holds user sensitive data. Maintainability- The software admin should be able to manage the data. Portability- The project should be executable on any Windows OS.

3.4 Related Work A.Shen etal (2007) demonstrate the efficiency of classification models to credit card fraud detection problem and the authors proposed the three classification models ie., decision tree, neural network and logistic regression. Among the three models neural network and logistic regression outperforms than the decision tree. M.J.Islam et al (2007) proposed the probability theory frame work for making decision under uncertainty. After reviewing Bayesian theory, naïve bayes classifier and k-nearest neighbor classifier is implemented and applied to the dataset for credit card system.Y. Sahin and E. Duman(2011) has cited the research for credit card fraud detection and used seven classification methods took a major role .In this work they have included decision trees and SVMs to decrease the risk of the banks. They have suggested Artificial Neural networks and Logistic Regression classification models are more helpful to improve the performance in detecting the frauds. Y. Sahin, E. Duman(2011) has cited the research , used Artificial Neural Network and Logistic Regression Classification and explained ANN classifiers outperform LR classifiers in solving the problem under investigation.

21 | P a g e

Here the training data sets distribution became more biased and the distribution of the training data sets became more biased and the efficiency of all models decreased in catching the fraudulent transactions.

3.5 Proposed Technique The proposed techniques are used in this paper, for detecting the frauds in credit card system. The comparison are made for different machine learning algorithms such as Logistic Regression, Decision Trees, Random Forest, to determine which algorithm gives suits best and can be adapted by credit card merchants for identifying fraud transactions. The Figure1 shows the architectural diagram for representing the overall system framework.

22 | P a g e

CHAPTER- 04 METHODOLOGY 4.1 Method The approach that this paper proposes, uses the latest machine learning algorithms to detect anomalous activities, called outliers. The basic rough architecture diagram can be represented with the following figure.

Fig:- Basic Outline Architecture When looked at in detail on a larger scale along with real life elements, the full architecture diagram can be represented as follows:

Fig:- Full Architectural Diagram 23 | P a g e

First of all, we obtained our dataset from Kaggle, a data analysis website which provides datasets. Inside this dataset, there are 31 columns out of which 28 are named as v1-v28 to protect sensitive data. The other columns represent Time, Amount and Class. Time shows the time gap between the first transaction and the following one. Amount is the amount of money transacted. Class 0 represents a valid transaction and 1 represents a fraudulent one. We plot different graphs to check for inconsistencies in the dataset and to visually comprehend it

Fig:- No. of Fraudulent and Non-Fraudulent Cases This graph shows that the number of fraudulent transactions is much lower than the legitimate ones

Fig:- Graphical Representation of Transaction (Time Feature) This graph shows the times at which transactions were done within two days. It can be seen that the least number of transactions were made during night time and highest during the days. 24 | P a g e

Fig:- Graphical Representation of Transaction (Monetary Value Feature)

This graph represents the amount that was transacted. A majority of transactions are relatively small and only a handful of them come close to the maximum transacted amount. After checking this dataset, we plot a histogram for every column. This is done to get a graphical representation of the dataset which can be used to verify that there are no missing any values in the dataset. This is done to ensure that we don’t require any missing value imputation and the machine learning algorithms can process the dataset smoothly.

Fig:- Processes of CCFD After this analysis, we plot a heatmap to get a coloured representation of the data and to study the correlation between out predicting variables and the class variable. 25 | P a g e

The dataset is now formatted and processed. The time and amount column are standardized and the Class column is removed to ensure fairness of evaluation. The data is processed by a set of algorithms from modules. The following module diagram explains how these algorithms work together: This data is fit into a model and the following outlier detection modules are applied on it: • Local Outlier Factor • Isolation Forest Algorithm These algorithms are a part of sklearn. The ensemble module in the sklearn package includes ensemble-based methods and functions for the classification, regression and outlier detection. This free and open-source Python library is built using NumPy, SciPy and matplotlib modules which provides a lot of simple and efficient tools which can be used for data analysisand machine learning. It features various classification, clustering and regression algorithms and is designed to interoperate with the numerical and scientific libraries. We’ve used Jupyter Notebook platform to make a program in Python to demonstrate the approach that this paper suggests. This program can also be executed on the cloud using Google Collab platform which supports all python notebook files. Detailed explanations about the modules with pseudocodes for their algorithms and output graphs are given as follows:

4.1.1 Local Outlier Factor It is an Unsupervised Outlier Detection algorithm. 'Local Outlier Factor' refers to the anomaly score of each sample. It measures the local deviation of the sample data with respect to its neighbors. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local data. By comparing the local values of a sample to that of its neighbors, one can identify samples that are substantially lower than their neighbors. These values are quite amanous and they are considered as outliers. As the dataset is very large, we used only a fraction of it in out tests to reduce processing times. The final result with the complete dataset processed is also determined and is given in the results section of this paper.

4.1.2 Isolation Forest Algorithm The Isolation Forest ‘isolates’ observations by arbitrarily selecting a feature and then randomly selecting a split value between the maximum and minimum values of the designated feature. Recursive partitioning can be represented by a tree, the number of splits required to isolate a sample is equivalent to the path length root node to terminating node. The average of this path length gives a measure of normality and the decision function which we use.

26 | P a g e

Partitioning them randomly produces shorter paths for anomalies. When a forest of random trees mutually produces shorter path lengths for specific samples, they are extremely likely to be anomalies. Once the anomalies are detected, the system can be used to report them to the concerned authorities. For testing purposes, we are comparing the outputs of these algorithms to determine their accuracy and precision.

4.2 Algorithm 4.2.1 Logistic Regression Logistic Regression is one of the classification algorithm, used to predict a binary values in a given set of independent variables (1 / 0, Yes / No, True / False). To represent binary / categorical values, dummy variables are used. For the purpose of special case in the logistic regression is a linear regression, when the resulting variable is categorical then the log of odds are used for dependent variable and also it predicts the probability of occurrence of an event by fitting data to a logistic function. Such as O = e^(I0 + I1*x) / (1 + e^(I0 + I1*x)) (3.1) Where, O is the predicted output I0 is the bias or intercept term I1 is the coefficient for the single input value (x). Regression is a regression model where the dependent variable is categorical and analyzes the relationship between multiple independent variables. There are many types of logistic regression model such as binary logistic model, multiple logistic model, binomial logistic models. Binary Logistic Regression model is used to estimate the probability of a binary response based on one or more predictors.

Fig:- Logistic Curve

This graph shows the difference between linear regression and logistic regression where logistic regression shows a curve but linear regression represents a straight line.

27 | P a g e

4.2.2 SVM Model (Support Vector Machine) SVM is a one of the popular machine learning algorithm for regression, classification. It is a supervised learning algorithm that analyses data used for classification and regression. SVM modeling involves two steps, firstly to train a data set and to obtain a model & then, to use this model to predict information of a testing data set. A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane where SVM model represents the

training data points as points in space and then mapping is done so that the points which are of different classes are divided by a gap that is as wide as possible. Mapping is done in to the same space for new data points and then predicted on which side of the gap they fall.

Fig:- SVM Model Graph

In SVM algorithm, plotting is done as each data item is taken as a point in n-dimensional space where n is number of features, with the value of each feature being the value of a particular coordinate. Then, classification is performed by locating the hyper-plane that separates the two classes very well.

28 | P a g e

4.2.3 Decision Tree Algorithm Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or subpopulations) based on most significant splitter / differentiator in input variables.

Fig:- Decision Tree

4.2.3.1 Types of Decision Trees 1. Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as categorical variable decision tree. 2. Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as Continuous Variable Decision Tree.

4.2.3.2 Terminology of Decision Trees 1. Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets. 2. Splitting: It is a process of dividing a node into two or more sub-nodes. 3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node. 4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node. 5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You

can say

opposite process of splitting. 6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree. 7. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

4.2.3.3 Working of Decision Trees Decision trees use multiple algorithms to decide to split a node in two or more sub- nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of 29 | P a g e

the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes. 1. Gini Index 2. Information Gain 3. Chi Square 4. Reduction of Variance

4.2.5 Random Forest Random forest is a tree-based algorithm which involves building several trees and combining with the output to improve generalization ability of the model. This method of combining trees is known as an ensemble method. Ensembling is nothing but a combination of weak learners (individual trees) to produce a strong learner. Random Forest can be used to solve regression and classification problems. In regression problems, the dependent variable is continuous. In classification problems, the dependent variable is categorical.

4.2.5.1 Working of Random Forest Bagging Algorithm is used to create random samples. Data set D1 is given for n rows and m columns and new data set D2 is created for sampling n cases at random with replacement from the original data. From dataset D1,1/3rd of rows are left out and is known as Out of Bag samples. Then, new dataset D2 is trained to this models and Out of Bag samples is used to determine unbiased estimate of the error. Out of m columns, M