407916/COMP723 Data Mining and Knowledge Engineering

407916/COMP723 Data Mining and Knowledge Engineering
Assignment 2 – Text Classification (50%)
1 Objective
To develop a broad understanding of text mining by performing a representative task, Text
2 Task Specification
This assignment requires you to extend your data mining skills and knowledge from
structured context to unstructured context, where the items to be classified are “free text”
snippets. You are required to use Weka to train a range of classification algorithms on the
given dataset, analyse the results and present a report of your findings.
2.1.1 Due Dates and Submission
• This assignment is to be done in pairs. The report should clearly state the name and
student ID of both members of the team. Furthermore, the contributions made by each
team member must be clearly stated.
• The due date for the written part of your assignment is due on 30 October at 6pm.
• You are required to submit two copies of the assignments.
o A pdf copy via the turnitin assignment Submission tab (on the course
homepage) on AutOnline.
o A second copy via the normal submission tab (non-turnitin), submission tab on
2.1.2 Marking
• This assignment will be marked out of 100 marks and is worth 50% of the overall
mark for the paper.
• To pass this module you must pass each assessment separately, and gain at least 50%
in total. The minimum pass mark for this assignment is 40%.
3 Assignment Details
• Download the data file from AutOnline under the Assignment 2 folder. The corpus
contains 5574 emails classified as SPAM (the positive class) and non-SPAM, the
negative class as a single file.
• Read through the readme file to understand the data.
• Either programmatically or otherwise convert the data file into the following 2 arff files.
o First file one containing 66 percent of the instances for training
o Second file one containing 34 percent of the instances for texting
• Ensure you have equal proportions of positive and negative instances in across each the
files above.
• Convert the arff files into Word Vectors by applying the StringToWordVector filter.
• Use NaiveBayes and 3 other classifiers of your choice from Weka to train a model and
validate the accuracy of the model on the testing dataset. Record these results and use
this as the benchmark for comparison to other runs later.
• Now use feature engineering as discussed in lectures in an attempt to improve the
accuracy of your classification using the same 4 algorithms as above. Your feature
engineering can include any or multiple parameters given in the screen dump below.
• You should make feature property changes in a systematic way and record all your
results so that it can be presented in a graphical form if applicable.
• Once you have settled on the optimum set of features, reconfigure the training dataset
and make it balanced with respect to the number of positive and negative class
instances. Now train the classifiers and record the results for the testing dataset.
• Now use 2 attribute selection methods to select a subset of attributes in an attempt to
improve the accuracy.
3.1.1 Written Report
• You will write a minimum of 6 and a maximum of 12 page report (excluding the
references and appendix) describing the results of your experiment.
• You are required to write a coherent report describing all aspects of the experiment as
an attempt to get the best possible accuracy for the classification task. Any screen
shots or large result outputs that doesn’t directly contribute to your argument should
be included in the appendix, rather than as part of the main report.
• You are not required to have a table of contents or executive summary for this report.
• There is no fixed format for the report. You can format it close to an academic paper
containing the usual sections such as Abstract, Introduction, Data Description,
Results, Discussion, Conclusion and a bibliography.
• As a minimum your report should contain a discussion of the following points
1. Presentation and discussion of the results obtained. You should use the correct
evaluation metrics in your discussion.
2. The rationale used for feature engineering decisions and tuning of any
parameter values for the classifiers used.
3. A discussion of the results for the imbalanced data, including the effect of
imbalance on classifiers and the methods used to deal with this imbalance.
4. A comparison of the test results from the classifiers trained on balanced dataset
to imbalanced dataset.
5. A brief discussion of applications of text classification.
6. The difference between the use of generic machine learning algorithms for
structured data such as what you did for the first assignment and what you did
for this assignment.
7. A discussion of the similarity and the differences of the machine learning
algorithms that you have used as applicable to text classification.
8. A reflection of what you learnt from this assignment and what you would do
differently if you were to do the assignment again.
• The following approximate matrix would be used to grade your assignment.
Written Report
Formatting, Language and Presentation 10%
Discussion to demonstrate an understanding of
the experimental tasks
Explanation of the rationale used for various
Discussion of the results 25%
Discussion to demonstrate an overall
understanding of text classification