Confusion Matrix in Cyber Crime

Prakhar Lad
5 min readJun 5, 2021

What is Confusion Matrix?
A confusion matrix is a type of table construct that plays a specific role in machine learning and related engineering. It helps to show the prediction and recall in a system where the test data values are known.

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

Let’s decipher the Confusion Matrix:

True Positive (TP)

  • The predicted value matches the actual value
  • The actual value was positive and the model predicted a positive value

True Negative (TN)

  • The predicted value matches the actual value
  • The actual value was negative and the model predicted a negative value

False Positive (FP) — Type 1 error

  • The predicted value was falsely predicted
  • The actual value was negative but the model predicted a positive value
  • Also known as the Type 1 error

False Negative (FN) — Type 2 error

  • The predicted value was falsely predicted
  • The actual value was positive but the model predicted a negative value
  • Also known as the Type 2 error

The confusion matrix is used to calculate precision and recall.

Accuracy is the performance measure used to check our model. It is preferred when the number of false positives values and the false negative values are the same. When the false-positive rates and the false negative rates are different then it is not such a good approach to check the performance of our classifier. In this situation, it is better to use an f1 score rather than an accuracy measure. It can be calculated using

Precision and Recall

Precision and recall metrics take the classification accuracy one step further and allow us to get a more specific understanding of model evaluation. Which one to prefer depends on the task and what we aim to achieve.

Precision measures how good our model is when the prediction is positive. It is the ratio of correct positive predictions to all positive predictions:

Recall measures how good our model is at correctly predicting positive classes. It is the ratio of correct positive predictions to all positive classes.

The focus of precision is positive predictions so it indicates how many positive predictions are true. The focus of recall is actual positive classes, indicating how many of the positive classes the model can predict correctly.

Another measure combines precision and recalls into a single number, and that is the F1 score.

F1 Score

F1 score is the weighted average of precision and recall.

F1 score is a more useful measure than accuracy for problems with uneven class distribution because it takes into account both false positive and false negatives.

The best value for the f1 score is 1 and the worst is 0.

How does the confusion matrix come into play when there is a bustle about cyber crime?

At present, there is no generalized framework is available to categorize cybercrime offences by feature extraction of the cases. In the present work, data analysis and machine learning are incorporated to build a cybercrime detection and analytics system. The proposed system’s design and implementation utilize classification, clustering and supervised algorithms. Here, naïve Bayes is used for classification and k-means are used for clustering. For feature extraction in the proposed work, the TFIDF or TF-IDF vector process is used. This developed methodology is based on 4 phases that are applied to the data, which are reconnaissance, preprocessing, data clustering and classification and prediction analysis.

Preprocessing, Clustering and Classification

In this phase only the feature extraction process takes place. It converts the high dimensional data to low dimensional data. These preprocessed data are helpful for data visualization because composite data can organize well when that complex data are converted as a less number of dimensions. For feature extraction in the proposed work, the TFIDF or TF–IDF vector process is used. It will evaluate the unigrams and bigrams of every corresponding cybercrime. One of the best approaches for the feature extraction is the use of a bag of a model, which means a model for each feature in our case finds out the presence of many different words that are taken into consideration, but not the order of the words they occur in each of the features.

Here, naïve Bayes is used for classification and k-means are used for clustering. The cybercrime offences are clustered based on the TFIDF weighted vectors obtained from the features. The data has considered by using a 70:30 thumb rule. Where 70% of data were utilized for training and 30% of the data were used for validation and testing purposes. By using these features the cybercrime incidents are categorized

Here arrives the confusion matrix

  1. Precision
  2. Recall
  3. F1 Score
  4. Accuracy

By this, we know how many cases are classified correctly and how many are classified incorrectly. It means we can find out the true negatives and true positives and false negatives and false positives classified by using the model.

Conclusion

The framework assumed is essential to the creation of a model that can support analytics regarding the identification, detection and classification of integrated cybercrime offences (structured and unstructured).

The aim is that the developed framework will provide the essential broad knowledge of cybercrime offences in the society, enable them to consider the threat landscape of such attacks and avoid the incarnation of the cybercrime offences. The developed framework reduces the time consumption and manual reporting process. It helps to identify the number of filing cases incident wise and area-wise. This report is useful to predict the cases and to take precautionary steps against filing cybercrime cases on certain hot-spot places identified.

Thank You for reading!!!

--

--