Mercurial > hg > machine-learning-hw6
annotate processEmail.m @ 1:e0f1290d2b43
Initial submission of work
author | Jordi Gutiérrez Hermoso <jordigh@octave.org> |
---|---|
date | Sun, 27 Nov 2011 23:18:00 -0500 |
parents | f602dc601e9e |
children | 7f92093ea77d |
rev | line source |
---|---|
0 | 1 function word_indices = processEmail(email_contents) |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
2 ##PROCESSEMAIL preprocesses a the body of an email and |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
3 ##returns a list of word_indices |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
4 ## word_indices = PROCESSEMAIL(email_contents) preprocesses |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
5 ## the body of an email and returns a list of indices of the |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
6 ## words contained in the email. |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
7 ## |
0 | 8 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
9 ## Load Vocabulary |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
10 vocabList = getVocabList(); |
0 | 11 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
12 ## Init return value |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
13 word_indices = []; |
0 | 14 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
15 ## ========================== Preprocess Email =========================== |
0 | 16 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
17 ## Find the Headers ( \n\n and remove ) |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
18 ## Uncomment the following lines if you are working with raw emails with the |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
19 ## full headers |
0 | 20 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
21 ## hdrstart = strfind(email_contents, ([char(10) char(10)])); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
22 ## email_contents = email_contents(hdrstart(1):end); |
0 | 23 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
24 ## Lower case |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
25 email_contents = lower(email_contents); |
0 | 26 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
27 ## Strip all HTML |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
28 ## Looks for any expression that starts with < and ends with > and replace |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
29 ## and does not have any < or > in the tag it with a space |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
30 email_contents = regexprep(email_contents, '<[^<>]+>', ' '); |
0 | 31 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
32 ## Handle Numbers |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
33 ## Look for one or more characters between 0-9 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
34 email_contents = regexprep(email_contents, '[0-9]+', 'number'); |
0 | 35 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
36 ## Handle URLS |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
37 ## Look for strings starting with http:// or https:// |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
38 email_contents = regexprep(email_contents, ... |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
39 '(http|https)://[^\s]*', 'httpaddr'); |
0 | 40 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
41 ## Handle Email Addresses |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
42 ## Look for strings with @ in the middle |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
43 email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr'); |
0 | 44 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
45 ## Handle $ sign |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
46 email_contents = regexprep(email_contents, '[$]+', 'dollar'); |
0 | 47 |
48 | |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
49 ## ========================== Tokenize Email =========================== |
0 | 50 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
51 ## Output the email to screen as well |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
52 fprintf('\n==== Processed Email ====\n\n'); |
0 | 53 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
54 ## Process file |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
55 l = 0; |
0 | 56 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
57 while ~isempty(email_contents) |
0 | 58 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
59 ## Tokenize and also get rid of any punctuation |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
60 [str, email_contents] = \ |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
61 strtok(email_contents, \ |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
62 [" @$/#.-:&*+=[]?!(){},'\">_<;%" char(10) char(13)]); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
63 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
64 ## Remove any non alphanumeric characters |
0 | 65 str = regexprep(str, '[^a-zA-Z0-9]', ''); |
66 | |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
67 ## Stem the word |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
68 ## (the porterStemmer sometimes has issues, so we use a try catch block) |
0 | 69 try str = porterStemmer(strtrim(str)); |
70 catch str = ''; continue; | |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
71 end_try_catch; |
0 | 72 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
73 ## Skip the word if it is too short |
0 | 74 if length(str) < 1 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
75 continue; |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
76 endif |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
77 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
78 ## Convert the vocabulary list |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
79 for i = 1:numel (vocabList) |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
80 if strcmp (vocabList{i}, str) |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
81 word_indices(end+1) = i; |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
82 break; |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
83 endif |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
84 endfor |
0 | 85 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
86 ## Print to screen, ensuring that the output lines are not too long |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
87 if (l + length(str) + 1) > 78 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
88 fprintf('\n'); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
89 l = 0; |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
90 endif |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
91 fprintf("%s ", str); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
92 l = l + length(str) + 1; |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
93 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
94 endwhile |
0 | 95 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
96 ## Print footer |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
97 fprintf("\n\n=========================\n"); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
98 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
99 endfunction |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
100 |