Mercurial > hg > machine-learning-hw6
annotate processEmail.m @ 3:ace890ed0ed9 default tip
Use lookup to look for all words at once
author | Jordi Gutiérrez Hermoso <jordigh@octave.org> |
---|---|
date | Sat, 10 Dec 2011 15:56:02 -0500 |
parents | 7f92093ea77d |
children |
rev | line source |
---|---|
0 | 1 function word_indices = processEmail(email_contents) |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
2 ##PROCESSEMAIL preprocesses a the body of an email and |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
3 ##returns a list of word_indices |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
4 ## word_indices = PROCESSEMAIL(email_contents) preprocesses |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
5 ## the body of an email and returns a list of indices of the |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
6 ## words contained in the email. |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
7 ## |
0 | 8 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
9 ## Load Vocabulary |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
10 vocabList = getVocabList(); |
0 | 11 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
12 ## Init return value |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
13 word_indices = []; |
0 | 14 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
15 ## ========================== Preprocess Email =========================== |
0 | 16 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
17 ## Find the Headers ( \n\n and remove ) |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
18 ## Uncomment the following lines if you are working with raw emails with the |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
19 ## full headers |
0 | 20 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
21 ## hdrstart = strfind(email_contents, ([char(10) char(10)])); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
22 ## email_contents = email_contents(hdrstart(1):end); |
0 | 23 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
24 ## Lower case |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
25 email_contents = lower(email_contents); |
0 | 26 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
27 ## Strip all HTML |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
28 ## Looks for any expression that starts with < and ends with > and replace |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
29 ## and does not have any < or > in the tag it with a space |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
30 email_contents = regexprep(email_contents, '<[^<>]+>', ' '); |
0 | 31 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
32 ## Handle Numbers |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
33 ## Look for one or more characters between 0-9 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
34 email_contents = regexprep(email_contents, '[0-9]+', 'number'); |
0 | 35 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
36 ## Handle URLS |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
37 ## Look for strings starting with http:// or https:// |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
38 email_contents = regexprep(email_contents, ... |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
39 '(http|https)://[^\s]*', 'httpaddr'); |
0 | 40 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
41 ## Handle Email Addresses |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
42 ## Look for strings with @ in the middle |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
43 email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr'); |
0 | 44 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
45 ## Handle $ sign |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
46 email_contents = regexprep(email_contents, '[$]+', 'dollar'); |
0 | 47 |
48 | |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
49 ## ========================== Tokenize Email =========================== |
0 | 50 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
51 ## Output the email to screen as well |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
52 fprintf('\n==== Processed Email ====\n\n'); |
0 | 53 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
54 ## Process file |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
55 l = 0; |
0 | 56 |
3
ace890ed0ed9
Use lookup to look for all words at once
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
2
diff
changeset
|
57 str_words = {}; |
ace890ed0ed9
Use lookup to look for all words at once
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
2
diff
changeset
|
58 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
59 while ~isempty(email_contents) |
0 | 60 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
61 ## Tokenize and also get rid of any punctuation |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
62 [str, email_contents] = \ |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
63 strtok(email_contents, \ |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
64 [" @$/#.-:&*+=[]?!(){},'\">_<;%" char(10) char(13)]); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
65 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
66 ## Remove any non alphanumeric characters |
0 | 67 str = regexprep(str, '[^a-zA-Z0-9]', ''); |
68 | |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
69 ## Stem the word |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
70 ## (the porterStemmer sometimes has issues, so we use a try catch block) |
0 | 71 try str = porterStemmer(strtrim(str)); |
72 catch str = ''; continue; | |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
73 end_try_catch; |
0 | 74 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
75 ## Skip the word if it is too short |
0 | 76 if length(str) < 1 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
77 continue; |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
78 endif |
3
ace890ed0ed9
Use lookup to look for all words at once
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
2
diff
changeset
|
79 |
ace890ed0ed9
Use lookup to look for all words at once
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
2
diff
changeset
|
80 ## Store the words |
ace890ed0ed9
Use lookup to look for all words at once
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
2
diff
changeset
|
81 str_words{end+1} = str; |
0 | 82 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
83 ## Print to screen, ensuring that the output lines are not too long |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
84 if (l + length(str) + 1) > 78 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
85 fprintf('\n'); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
86 l = 0; |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
87 endif |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
88 fprintf("%s ", str); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
89 l = l + length(str) + 1; |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
90 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
91 endwhile |
0 | 92 |
3
ace890ed0ed9
Use lookup to look for all words at once
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
2
diff
changeset
|
93 word_indices = lookup (vocabList, str_words, "m"); |
ace890ed0ed9
Use lookup to look for all words at once
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
2
diff
changeset
|
94 word_indices (word_indices == 0) = []; |
ace890ed0ed9
Use lookup to look for all words at once
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
2
diff
changeset
|
95 |
1
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
96 ## Print footer |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
97 fprintf("\n\n=========================\n"); |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
98 |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
99 endfunction |
e0f1290d2b43
Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
0
diff
changeset
|
100 |