Arabic-tweets-sentiment-analysis

Marwah Faraj

Linkedin | Github | E-mail | Arabic tweets Dataset | Poject Presentation

Table of Contents

  1. Background and Motivation

  2. Data

  3. Exploration

  4. Visualization

  5. Machine Learning

  6. Conclusions

  7. Further Study

Background and Motivation

Implementing sentiment analysis in Arabic Language. I’m using NLP (Natural Language Processing) to demonstrate that and Machine Learning Algorithms to classify whether the sentiment is positive or negative.


Data

This dataset consists of more than 56 thousand tweets in Arabic Language, and it is devided into 2 part, more than 28500 positive tweets and more than 28300 negative tweet, and it does not contain any null value.

Description

Arabic Script

  • Written right-to-left
  • Letters have contextual variants
  • Used to write many languages besides Arabic: Persian, Kurdish, Urdu, etc.

The main challenges for Arabic Language Processing:

  • Orthographic ambiguity الابهام الاملائي
  • Morphological richness الغنى الصرفي
  • Dialectal variation تعدد اللهجات
  • Orthographic inconsistency الاخطاء الاملائية
  • Resource poverty (data & tools) فقر موارد البيانات والادوات
  • Limited research البحث العلمي المحدود

Pipline

The pandas, numpy, NLTK library, machine learning: Multinomial Naive Bayes, Gaussian Naive Bayes, Ridge_classifier, Logstic Regression, Random Forest, deep learning: Multilayer perseptron/ Tenseorflow, keras, matplotlib, and seaborn software libraries was used to examine, plot, analyze and classify this data.


Exploration

Noticed this dataset do not contian null values as shown in the dataset info, but noticed that it contains a lot of emojies. After further exploration notice that there is english words and english numbers in the dataset, and the dataset is a balance data.

visualization

The data status: Balance data

The arabic people use the word ‘God’ a lot in their conversation and as shown the most common word is ‘God’

For the same reason the most common word in the negative tweets is ‘God’


Machine Learning

Diffrient machine learning algorithms implemented and used to predict on unseed dataset to classify new tweet and give diffrint scores as detailed below:
1- Naive Bayes Algorithm:

  • Multinomial Naive Bayes Algorthim:
    Accuracy= 0.767
    Precision= 0.770
    Recall= 0.760
    F1= 0.770
  • Gaussian Naive Bayes Algorthim:
    Accuracy= 0.741
    Precision= 0.830
    Recall= 0.610
    F1= 0.700

2- Ridge Classifier:
Accuracy= 0.792
Precision= 0.810
Recall= 0.770
F1= 0.790

3- LogisticRegression Algorithm:
Accuracy= 0.790
Precision= 0.820
Recall= 0.750
F1= 0.780

4- Random Foreset Algorthim:
Accuracy= 0.677
Precision= 0.780
Recall= 0.490
F1= 0.600

Deep Learning

1- Multi layer perseptron:
loos score= 0.4372
Accuracy= 0.804

The MLP model did not predict all the positive and negative tweets correctly, as it appeared in the graph, there are 974 negative tweet were predicted positive, also there are 1251 positive tweets were predicted negative.

Machine Learning Algorithms comparison:

Nural Newtwork/ Multi layer perceptrone gave the higher accuracy score 80.4% but it is a time consuming to train, and for that Ridge classifier model took less time to train and gave a 79.2%. The MLP model is deployed in flask app to predict on new tweet and classify it weather it is positive or negative.

Deploy the Model in Flask app


Conclusions

  • Using deep learning by applying Multilayer Perceptron (MLP) in prediction gave the higher accuracy among the models 80.4%.
  • Sentiment analysis is extremely useful in social media monitoring as it allows us to gain an overview of the wider public opinion behind certain topics, and monitor for toxic posts to warn and ban offending users.

Further study

  • Consider emoji and how it could change the analysis.
  • Use Camel Tool Modules in preparing the text to use in machine learning Models. (Arabic Language specific library)

Comments

  1. In theory AraBERT: Bidirectional Encoder Representations from Transformers based Model for Arabic Language detect sentiment with more accuracy, because when used off the shelf BERT model was used on English Language Amazon reviews without any additional training it was able to achieve 90% accuracy on positive and negative labeled data. After training on the other 3.6M Amazon reviews an accuracy of 97% was achieved (70% decrease in error). https://github.com/cmiddleman/BERT_sentiment_analysis/

    ReplyDelete
  2. He went into his payroll account and made certain to take out an additional $200 each two weeks. Politicians say the legalization of gambling provides states with much-needed extra revenue and permits the government to extra adequately oversee gambling and responsibly assist those that develop points. These arguments, as well as|in addition to} ones in opposition to prohibitionary schemes usually, are difficult to refute in theory; in apply, it’s not but clear whether or not state agencies are taking their oversight commitments as seriously as they should to}. Gambling dysfunction can trigger a number of emotional and bodily signs and cause life-altering incidents to occur, in accordance with PsychGuides. Individuals coping with gambling dysfunction may also develop anxiety and melancholy, which in flip may cause points similar 1xbet korea to sleep deprivation and weight loss. Over time, the dysfunction may disrupt their relationships and compel them to turn to alcohol or drugs as an unhealthy coping mechanism.

    ReplyDelete

Post a Comment