annotate processEmail.m @ 1:e0f1290d2b43

Initial submission of work
author Jordi Gutiérrez Hermoso <jordigh@octave.org>
date Sun, 27 Nov 2011 23:18:00 -0500
parents f602dc601e9e
children 7f92093ea77d
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
1 function word_indices = processEmail(email_contents)
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
2 ##PROCESSEMAIL preprocesses a the body of an email and
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
3 ##returns a list of word_indices
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
4 ## word_indices = PROCESSEMAIL(email_contents) preprocesses
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
5 ## the body of an email and returns a list of indices of the
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
6 ## words contained in the email.
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
7 ##
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
8
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
9 ## Load Vocabulary
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
10 vocabList = getVocabList();
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
11
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
12 ## Init return value
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
13 word_indices = [];
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
14
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
15 ## ========================== Preprocess Email ===========================
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
16
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
17 ## Find the Headers ( \n\n and remove )
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
18 ## Uncomment the following lines if you are working with raw emails with the
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
19 ## full headers
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
20
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
21 ## hdrstart = strfind(email_contents, ([char(10) char(10)]));
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
22 ## email_contents = email_contents(hdrstart(1):end);
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
23
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
24 ## Lower case
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
25 email_contents = lower(email_contents);
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
26
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
27 ## Strip all HTML
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
28 ## Looks for any expression that starts with < and ends with > and replace
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
29 ## and does not have any < or > in the tag it with a space
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
30 email_contents = regexprep(email_contents, '<[^<>]+>', ' ');
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
31
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
32 ## Handle Numbers
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
33 ## Look for one or more characters between 0-9
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
34 email_contents = regexprep(email_contents, '[0-9]+', 'number');
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
35
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
36 ## Handle URLS
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
37 ## Look for strings starting with http:// or https://
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
38 email_contents = regexprep(email_contents, ...
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
39 '(http|https)://[^\s]*', 'httpaddr');
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
40
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
41 ## Handle Email Addresses
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
42 ## Look for strings with @ in the middle
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
43 email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
44
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
45 ## Handle $ sign
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
46 email_contents = regexprep(email_contents, '[$]+', 'dollar');
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
47
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
48
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
49 ## ========================== Tokenize Email ===========================
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
50
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
51 ## Output the email to screen as well
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
52 fprintf('\n==== Processed Email ====\n\n');
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
53
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
54 ## Process file
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
55 l = 0;
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
56
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
57 while ~isempty(email_contents)
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
58
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
59 ## Tokenize and also get rid of any punctuation
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
60 [str, email_contents] = \
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
61 strtok(email_contents, \
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
62 [" @$/#.-:&*+=[]?!(){},'\">_<;%" char(10) char(13)]);
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
63
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
64 ## Remove any non alphanumeric characters
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
65 str = regexprep(str, '[^a-zA-Z0-9]', '');
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
66
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
67 ## Stem the word
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
68 ## (the porterStemmer sometimes has issues, so we use a try catch block)
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
69 try str = porterStemmer(strtrim(str));
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
70 catch str = ''; continue;
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
71 end_try_catch;
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
72
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
73 ## Skip the word if it is too short
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
74 if length(str) < 1
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
75 continue;
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
76 endif
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
77
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
78 ## Convert the vocabulary list
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
79 for i = 1:numel (vocabList)
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
80 if strcmp (vocabList{i}, str)
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
81 word_indices(end+1) = i;
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
82 break;
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
83 endif
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
84 endfor
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
85
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
86 ## Print to screen, ensuring that the output lines are not too long
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
87 if (l + length(str) + 1) > 78
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
88 fprintf('\n');
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
89 l = 0;
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
90 endif
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
91 fprintf("%s ", str);
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
92 l = l + length(str) + 1;
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
93
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
94 endwhile
0
f602dc601e9e Initial commit
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents:
diff changeset
95
1
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
96 ## Print footer
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
97 fprintf("\n\n=========================\n");
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
98
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
99 endfunction
e0f1290d2b43 Initial submission of work
Jordi Gutiérrez Hermoso <jordigh@octave.org>
parents: 0
diff changeset
100