-
Notifications
You must be signed in to change notification settings - Fork 1
/
Scalar_Backup
1863 lines (1689 loc) · 306 KB
/
Scalar_Backup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:scalar="http://scalar.usc.edu/2012/01/scalar-ns#" xmlns:prov="http://www.w3.org/ns/prov#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:ov="http://open.vocab.org/terms/" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:oac="http://www.openannotation.org/ns/" xmlns:art="http://simile.mit.edu/2003/10/ontologies/artstor#">
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/advanced-options">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-16T10:37:11+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:316238"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/advanced-options.1"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/advanced-options.1"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/advanced-options.1">
<ov:versionnumber>1</ov:versionnumber>
<dcterms:title>Advanced Options</dcterms:title>
<dcterms:description>Manual page for the Lexos Tokenize and Analyze Advanced Options</dcterms:description>
<sioc:content><h4><u>Tokenize</u></h4>By default Lexos splits strings of text into tokens every time it encounters a space character. For Western languages, this means that each token generally corresponds to a word. Click the <strong>by Characters</strong> radio button to treat every character as a separate token. If you wish to use n-grams, increase the <strong>1-gram</strong> incrementer to 2, 3, 4, etc. For example, &quot;the dog ran&quot; would produce the 1-gram tokens <em>the</em>, <em>dog</em>, <em>ran</em>., the 2-grams <em>the dog</em>, <em>dog ran</em>, and so on. 2-grams tokenized by characters would begin <em>th</em>, <em>he</em>, <em>e&nbsp;</em>, and so on.<br /><br />Note that increasing the n-gram size my produce a larger DTM, and the table will thus take longer to build.<h4><u>Culling Options</u></h4>&quot;Culling Options&quot; is a generic term we use for methods of decreasing the number of terms used to generate the DTM based on statistical criteria (as opposed to something like applying a stopword list in <strong>Scrubber</strong>). Lexos offer three different methods:<br /><br />1. <strong>Most Frequent Words</strong>: This method takes a slice of the DTM containing only the top N most frequently occurring terms. The default setting is 100.<br />2. <strong>Culling</strong>: This method builds the DTM using only terms that occur in at least N documents. The default setting is 1.<br />3. <strong>Greywords</strong>: This method removes from the DTM those terms occurring in particularly low frequencies. Lexos calculates the cut-off point based on the average length of your documents.<h4><u>Normalize</u></h4>By default, Lexos displays the frequency of the occurrence of terms in your documents as a proportion of the entire text. If you wish to see the actual number of occurrences, click the <strong>Raw Counts</strong> radio button. You may also attempt to take into account differences in the lengths of your documents by calculating their <a target="_blank" href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">Term Frequency-Inverse Document Frequency (TF-IDF)</a>. Lexos offers three different methods of calculating TF-IDF based on <strong>Euclidean Distance</strong>, <strong>Manhattan Distance</strong>, or without using a distance metric (<strong>Norm: None</strong>). For further discussion on these options, see the topics article on [TF-IDF](http://scalar.usc.edu/works/lexos/tf-idf).<h4><u>Assign Temporary Labels</u></h4>Lexos automatically uses the label in the &quot;Document Name&quot; column in the <strong>Manage</strong> tool as the document label. However, you may change the label used in your table by entering a new value for it in the forms displayed in <strong>Assign Temporary Labels</strong>. This is particularly useful if you want to save different labels when you download your DTM. Keep in mind that whatever labels you set will be applied in all other Lexos tools that use the Advanced Options. However, the original document name in <strong>Manage</strong> will not be affected. After assigning temporary labels in <strong>Tokenizer</strong>, click the <strong>Regenerate Table</strong> button to rebuild the table with the new labels.</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-16T10:37:11+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:834672"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/advanced-options"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/ajaxtest">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<scalar:customScript>$(document).ready(function(){
$("#github-content").load("https://raw.githubusercontent.com/WheatonCS/Lexos/master/0_InstallGuides/Windows/README.md", function(markdown){
var md = new Remarkable('commonmark');
var html = md.render(markdown);
$("#github-content").html(html).show();
});
});</scalar:customScript>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-07-08T14:33:47+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:477723"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/ajaxtest.7"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/ajaxtest.7"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/ajaxtest.7">
<ov:versionnumber>7</ov:versionnumber>
<dcterms:title>Content from GitHub</dcterms:title>
<sioc:content><p>Material below the horizontal rule is fetched from the Lexos GitHub repository using Ajax. This avoids Scalar&#39;s clunky editor (you can edit offline) and provides better version control. The downside is that you don&#39;t get Scalar&#39;s embedding markup (I&#39;m not sure how necessary it actually is). Although I&#39;ve used Ajax, we could also grab the material from GitHub&#39;s API. But this might actually be more complicated.</p><p>Since the original document on GitHub is in Markdown, I have had to use a script to convert it to HTML. I&#39;ve used <a target="_blank" href="https://github.com/jonschlinkert/remarkable">Remarkable</a>, which seems to work pretty well.</p><hr /><script src="https://cdn.jsdelivr.net/remarkable/1.7.1/remarkable.min.js"><br /> <br /> <br /> <br /></script><div id="github-content" style="display:none;">&nbsp;</div></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-07-08T15:44:50+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1257687"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/ajaxtest"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/bibliography">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-09T06:58:09+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:160478"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/bibliography.8"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/bibliography.8"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/bibliography.8">
<ov:versionnumber>8</ov:versionnumber>
<dcterms:title>Bibliography</dcterms:title>
<dcterms:description>Beginning of bibliography path</dcterms:description>
<sioc:content><p>We are working on our bibliography. In the meantime, check out the Zotero bibliographies listed below.</p><!--<br />undefined<iframe src="http://bibbase.org/show?bib=https%3A%2F%2Fapi.zotero.org%2Fgroups%2F47671%2Fitems%3Fkey%3DrjMBGMVIUHoqwuGZRNI4HppO%26format%3Dbibtex%26limit%3D100" style="min-height:600px;" width="100%"><br /></iframe><br />--></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-04-04T14:59:08+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1105242"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/bibliography"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1105242:439018:1">
<scalar:urn rdf:resource="urn:scalar:path:1105242:439018:1"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/bibliography.8"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/dariah-bibliography.6#index=1"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/dariah-bibliography">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-29T17:40:26+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:176729"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/dariah-bibliography.6"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/dariah-bibliography.6"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/dariah-bibliography.6">
<ov:versionnumber>6</ov:versionnumber>
<dcterms:title>DARIAH Bibliography</dcterms:title>
<dcterms:description>Zotero bibliography for Digital Humanities maintained by the DARIAH collaborative</dcterms:description>
<sioc:content><iframe src="https://www.zotero.org/groups/doing_digital_humanities_-_a_dariah_bibliography/items" style="height:800px;width:100%;border:1px solid #000;"></iframe></sioc:content>
<scalar:defaultView>blank</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-29T17:47:52+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:439018"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/dariah-bibliography"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1105242:1105215:2">
<scalar:urn rdf:resource="urn:scalar:path:1105242:1105215:2"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/bibliography.8"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/stylometry-bibliography.1#index=2"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/stylometry-bibliography">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-04-04T14:56:43+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:416203"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/stylometry-bibliography.1"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/stylometry-bibliography.1"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/stylometry-bibliography.1">
<ov:versionnumber>1</ov:versionnumber>
<dcterms:title>Stylometry Bibliography</dcterms:title>
<dcterms:description>Stylometry bibliography generated from DH 2016 conference</dcterms:description>
<sioc:content><iframe src="https://www.zotero.org/groups/stylometry_bibliography/items/order/year/sort/desc" style="height:800px;width:100%;border:1px solid #000;"><br /></iframe></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-04-04T14:56:43+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1105215"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/stylometry-bibliography"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/bubbleviz">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-04T10:08:28+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:159681"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/bubbleviz.4"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/bubbleviz.4"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/bubbleviz.4">
<ov:versionnumber>4</ov:versionnumber>
<dcterms:title>The BubbleViz Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos BubbleViz tool</dcterms:description>
<sioc:content><strong>BubbleViz</strong> offers an alternative to word clouds as a method of visualizing the <strong>Document-Term Matrix</strong>. They present terms arranged inside circles (&quot;bubbles&quot;) sized according to the terms&#39; frequency within the text. <strong>BubbleViz</strong> graphs enable you to get a sense of the content in your corpus, and they are very good for presentations. To generate a <strong>BubbleViz</strong>, select some or all of your active documents using the <strong>Select Document(s)</strong> check boxes. The Lexos <strong>BubbleViz</strong> tool allows you to control the <strong>Graph Size</strong> (in pixels) and to filter the <strong>Maximum Number of Terms</strong>. You can also set a <strong>Minimum Term Length</strong>: the minimum number of characters required in a term for it to be added to the graph. Once you have chosen your options, click the <strong>Get Graph</strong> button to generate the graph. If you then wish to download the graph as a PNG file, click the <strong>Save as PNG</strong> button.<br /><br />Run your mouse cursor over the bubble to open a tooltip showing the number of times the term occurs in you selected document(s).</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<scalar:continue_to_content_id>159681</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-16T20:44:10+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:834855"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/bubbleviz"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/choosing-a-distance-metric">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-12T00:20:24+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:172451"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/choosing-a-distance-metric.5"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/choosing-a-distance-metric.5"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/choosing-a-distance-metric.5">
<ov:versionnumber>5</ov:versionnumber>
<dcterms:title>Choosing a Distance Metric</dcterms:title>
<dcterms:description>More detailed discussion of distance metrics</dcterms:description>
<sioc:content>In hierarchical clustering, a distance metric must be chosen before running the algorithm for merging documents into clusters. K-means clustering uses standard Euclidean distance to determine the distance from the cluster centroid, but this or other distance measures can be used to evaluate the cluster quality. (In Lexos, this is done through the Silhouette Score, which can be calculated using multiple distance metrics.)
A few general observations have already been made under <a style="z-index: 0;" href="http://scalar.usc.edu/works/lexos/glossary#edit-distance" data-display-content-preview-box="true">Cluster Analysis</a>. The distance metric is essentially how you define the difference between your documents. The <span class="annotator-hl">Euclidean distance</span> metric measures the magnitude of the difference in distance between two document vectors (vectors of counts for each word in both documents). Non-Euclidean metrics like <a style="z-index: 0; position: relative;" href="http://scalar.usc.edu/works/lexos/glossary#cosine-similarity" data-display-content-preview-box="true">cosine similarity</a>, which measures the angle between the vectors, can also be converted into measures of distance between clusters. Since document-term matrices are often sparse (they contain a lot of term counts of 0), cosine similarity may be a better option for clustering larger documents, and particularly if the documents are of uneven lengths. But the emphasis must be placed on <i>may</i>. There are no hard and fast rules, although there is renewed attention to providing more nuanced help with the choice of metrics (Jannidis <i>et al.</i>&nbsp;2015, Eder 201?).
The circumstances under which certain distance metrics perform best, or even how to use machine learning to aid in the selection of such metrics, is the subject of ongoing research. However, much of it uses data very different from the type of material used in literary text analysis. Currently, our best advice is to be aware of how you are measuring distance and experiment with different linkage metrics, trying to explain how they operate on your texts. We provide a case study here which serves to introduce some of the most common metrics (all available in Lexos), and how they affect the results of a single data set.<div>
</div><div><a style="box-sizing: border-box; color: rgb(87, 62, 37); font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">Eder, M. (201?). Visualization in stylometry: some problems and solutions. To be published in <i>Digital Scholarship in the Humanities</i>.</a></div><div><a style="box-sizing: border-box; color: rgb(87, 62, 37); font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">
</a></div><div><a style="box-sizing: border-box; color: rgb(87, 62, 37); font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">Jannidis, F.,</a><a style="box-sizing: border-box; color: rgb(87, 62, 37); font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">&nbsp;Pielström, S.,</a><a style="box-sizing: border-box; color: rgb(87, 62, 37); font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">&nbsp;Schöch,&nbsp;</a><a style="color: rgb(87, 62, 37); box-sizing: border-box; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">C.,</a><a style="box-sizing: border-box; color: rgb(87, 62, 37); font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">&nbsp;Vitt,&nbsp;</a><a style="color: rgb(87, 62, 37); box-sizing: border-box; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">T</a><a style="color: rgb(87, 62, 37); box-sizing: border-box; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12.6000003814697px; line-height: 18.0000019073486px; background: 0px 0px rgb(255, 255, 255);">. (2015). Improving Burrows' Delta -- An empirical evaluation of text distance measures. Presented at DH 2015 Global Digital Humanities, Sydney, Australia, July 3, 2015.</a></div></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3689"/>
<dcterms:created>2015-08-13T15:45:22+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:431807"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/choosing-a-distance-metric"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/cluster-analysis">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-07-31T23:01:32+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:170618"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/cluster-analysis.35"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/cluster-analysis.35"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/cluster-analysis.35">
<ov:versionnumber>35</ov:versionnumber>
<dcterms:title>Cluster Analysis</dcterms:title>
<dcterms:description>The start page for the cluster analysis topics path</dcterms:description>
<sioc:content><em>In order to use cluster analysis successfully to interpret literary texts, it is important to have a good understanding of how the process works.</em><p>Cluster analysis may be formally defined as an <a href="http://scalar.usc.edu/works/lexos/glossary#unsupervised-learning">unsupervised learning</a> technique for finding &ldquo;natural&rdquo; groupings of given instances in unlabeled data. For the purposes of text analysis, a clustering method is a procedure that starts with a group of documents and attempts to organize them into relatively homogeneous groups called &ldquo;clusters&rdquo;. Clustering methods differ from classification methods (<a href="http://scalar.usc.edu/works/lexos/glossary#supervised-learning">supervised learning</a>) in that clustering attempts to form these groups entirely from the data, rather than by assigning documents to predefined groups with designated class labels.</p><p>Cluster analysis works by counting the frequency of terms occurring in documents and then grouping the documents based on similarities in these frequencies. When we observe that an author or text uses a term or a group of terms more than other authors or other texts do, we are innately using this technique. In making such claims, we rely on our memory, as well as sometimes unstated selection processes. It may be true, for instance, that Text A uses more terms relating to violence than Text B, but in fact, the difference between the two may be proportionally much less than the difference in the frequency of other terms such as &ldquo;the&rdquo; and &ldquo;is&rdquo; on which we do not traditionally base our interpretation. Cluster analysis leverages the ability of the computer to compare many more terms than is possible (or at least practical) for the human mind. It may therefore reveal patterns that can be missed by traditional forms of reading. Cluster analysis can be very useful for exploring similarities between texts or segments of texts and can also be used as a test for hypotheses you may have about your texts. But it is important to remember that the type of clustering discussed here relies on the frequency of terms, not their semantic qualities. As a result, it can only provide a kind of proxy for meaning. In order to use cluster analysis successfully to interpret literary texts, it is important to have a good understanding of how the process works.</p><p>Here, we acknowledge the many levels of expertise it requires to fully appreciate cluster analysis in general and specifically when choosing (i) metrics that define a distance between documents, (ii) metrics that manage the clustering of like documents, and (iii) clustering based on different features of the text (e.g. all or only the most frequent terms) (Eder, &quot;Computational Stylistics and Biblical Translation&quot;). But in all the details, we encourage the reader to seek the benefits of exploratory analysis like cluster analysis early and often, even as you are learning more about the statistical roots of different metrics. It always helps if you have&nbsp; local statistican to consult..&nbsp;</p><h2><strong>Document Similarity</strong></h2><p>In traditional literary criticism, concepts like &ldquo;genre&rdquo; are frequently used to group texts. There may be some taxonomic criteria for assigning texts to these groups, but recent work in Digital Humanities (Moretti, Jockers, Underwood) have highlighted that such categories can be usefully re-examined using quantitative methods. In cluster analysis, the basis for dividing or assigning documents into clusters is a statistical calculation of their dis(similarity). Similar documents are ones in which there is considerable homogeneity in the frequency with which terms are observed to occur therein. Documents are dissimilar when term frequencies are more heterogeneous.</p><p>A clearer picture emerges of this definition of similarity if we examine how it is measured. Imagine three documents as points within a coordinate space.</p><p><strong><a class="inline" resource="media/cluster-analysis-chart" href="media/ClusterChart1.PNG"></a></strong></p><p>Document A can be imagined to be more similar to Document B than to Document C by using the <a href="http://scalar.usc.edu/works/lexos/glossary#distance-metric">distance</a> between them as a metric. Simply draw a straight line between the points, and the documents with the shortest line between them are the most similar. When using cluster analysis for the study of texts, we take as our premise the idea that this notion of similarity measured as proximity may correlate to a range of historical and stylistic relationships amongst the documents in our corpus.</p><p>The graph above represents the end of the process. In order to plot our documents in coordinate space, we must first determine the coordinates. This is done by counting the terms in our documents to produce a <a data-mce-href="http://scalar.usc.edu/works/lexos/glossary#document-term-matrix" href="http://scalar.usc.edu/works/lexos/glossary#document-term-matrix">document-term matrix</a>. For instance a part of the document-term matrix that produced the graph above might be:</p><table border="1" width="100%"><tbody><tr><th>&nbsp;</th><th>man</th><th>woman</th></tr><tr><td>Document A</td><td>5</td><td>4</td></tr><tr><td>Document B</td><td>4</td><td>5</td></tr><tr><td>Document C</td><td>1</td><td>3</td></tr></tbody></table><p>The list of term counts for each document is called a <a data-mce-href="http://scalar.usc.edu/works/lexos/glossary#document-vector" href="http://scalar.usc.edu/works/lexos/glossary#document-vector">document vector</a>. Representing the text as a vector of term counts allows us to calculate the distance or dissimilarity between documents. We can easily convert the document-term matrix into a &quot;distance matrix&quot; (also called a &quot;dissimilarity matrix&quot;) by taking the difference between the term counts for each document vector.</p><table border="1" width="100%"><tbody><tr><td>Man</td><td>Document A</td><td>Document B</td><td>Document C</td></tr><tr><td>Document A</td><td>-</td><td>1</td><td>4</td></tr><tr><td>Document B</td><td>1</td><td>-</td><td>3</td></tr><tr><td>Document C</td><td>4</td><td>3</td><td>-</td></tr></tbody></table>&nbsp;<p>The distance from A to B is 1 while the distance from B to C is 3. Documents A and B form a cluster because the distance between them is shorter than between either and Document C.</p><p>Notice that the table above reproduces only the portion of the document vectors representing the frequency of the word &ldquo;man&rdquo;. Adding the &ldquo;woman&rdquo; portion creates considerable difficulties for us if we are trying to represent the data in rows and columns. That is because each term in the document vector represents a separate dimension of that vector. The full text of a document may be represented by a vector with thousands of dimensions. Imagine a spreadsheet with thousands of individual sheets, one for each term in the document, and you get the idea. In order for the human mind to interpret this data, we need to produce a flattening, or <a data-display-content-preview-box="true" href="glossary#dimensionality-reduction">dimensionality reduction</a>, of the whole distance matrix. The computer does this by algorithmically going through the distance matrix and adjusting the distance between each document vector on the <em>observed distances</em> between each of the terms. There are, in fact, different algorithms for doing this, and part of a successful use of cluster analysis involves choosing the algorithm best suited for the materials being examined.</p><p>Many discussions of this process begin with the notion of <a data-display-content-preview-box="true" href="glossary#feature-selection">feature selection</a>. In text analysis, this equates to determining what features of the text make up the document vector. The procedure for feature selection is essentially the processes of <a data-display-content-preview-box="true" href="glossary#scrubbing">scrubbing</a>, <a data-display-content-preview-box="true" href="glossary#cutting">cutting</a>, and <a data-display-content-preview-box="true" href="glossary#tokenization">tokenization</a>. You may also perform certain <a data-display-content-preview-box="true" href="glossary#normalization">normalization</a> measures that modify the term frequencies in order to account for differences in document length. Depending on how you perform these tasks, the results of cluster analysis can be very different.<br /><br />One of the factors that will influence your results is your choice of <a data-display-content-preview-box="true" href="glossary#distance-metric">distance metric</a>. The distance metric is essentially how you define the difference between your documents. A simple example using a distance metric might be the words <em>cat</em> and <em>can</em> (think of these words as documents composed of vectors of three letters each). A distance metric called <a data-display-content-preview-box="true" href="glossary#edit-distance">edit distance</a> can be defined as the number of character changes required to transform <em>cat</em> into <em>can</em> (here, just one change = 1). The difference between <em>cat</em> and <em>call</em> would be 2. Edit distances are very good for measuring the distance between short strings of characters like individual words, but they are unwieldy for longer document vectors. So for whole texts we need some other way of defining distance.&nbsp;</p><p>The <a data-display-content-preview-box="true" href="http://scalar.usc.edu/works/lexos/glossary#euclidean-distance">Euclidean distance</a> metric measures the magnitude of the difference in distance between two document vectors. The Euclidean distance is essentially the length of a line between a point representing a term on one vector and a point representing the same term on the other. You might then decide to take the average Euclidean distance for all points and treat that as the measure of document distance. Statisticians have developed a number of metrics based on modifications of Euclidean distance that can serve as alternative ways of defining document distance. Another approach is to measure the <a data-display-content-preview-box="true" href="glossary#similarity">similarity</a> of the two vectors. It may help the reader to visualize two documents with N unique words as represented by two vectors sticking out into N-space. Given two documents, a common method of computing distance is to calculate the <a data-display-content-preview-box="true" href="glossary#cosine-similarity">cosine similarity</a>, which is the angle between the two document vectors. Cosine similarity is not truly a measure of distance, but it can be converted into a distance measure by subtracting the cosine similarity value (which varies between 0 and 1) from 1.</p><p>The difference between Euclidean and non-Euclidean metrics for measuring document distance is largely one of perspective. Euclidean distances tend to measure the distance between documents at some point along the vector where the documents are already quite distinct from one another. In longer documents, this distinction may correlate to a lot of terms which are found in one document but not the other. In the document-term matrix, counts for these terms in the documents where they do not occur are recorded as 0. A distance matrix in which most of the elements are zero is called a <a data-display-content-preview-box="true" href="glossary#sparse-matrix">sparse matrix</a>. The more dimensions there are in the document vectors, the more likely it is that the distance matrix will be sparse. Many <i>Hapax legomena</i> (terms occurring only once) in your documents, for instance, will very likely produce a sparse document-term matrix.&nbsp;</p><p>Measuring Euclidean distance in a sparse matrix can affect the way clustering takes place. We would certainly expect this to be the case if one document was much longer than the other. Using cosine similarity is one way to address this problem since the angle between document vectors does not change depending on the location of points along the vector. There are a large variety of variants on these basic Euclidean and non-Euclidean approaches to defining the distance metric for clustering. For further discussion, see <a data-display-content-preview-box="true" href="choosing-a-distance-metric">Choosing a Distance Metric</a>.</p><p>Other factors influencing your results will depend on the type of cluster analysis you use, your choice of linkage method, the choice of token-type, and the size of the sample of terms considered (e.g., the whole text, as opposed to the top 100 most frequent terms).</p><h2>Types of Cluster Analysis</h2><p>There are two main approaches to clustering: <a data-display-content-preview-box="true" href="glossary#hierarchical-cluster-analysis">hierarchical</a> and <a data-display-content-preview-box="true" href="glossary#partitioning-cluster-analysis">partitioning</a>. <a data-display-content-preview-box="true" href="hierarchical-clustering">Hierarchical clustering</a> attempts to divide documents into a branching tree of groups and sub-groups, whereas partitioning methods attempt to assign documents to a pre-designated number of clusters. Clustering methods may also be described as <a data-display-content-preview-box="true" href="glossary#exclusive-cluster-analysis">exclusive</a>, generating clusters in which no document can belong to more than one cluster, or <a data-display-content-preview-box="true" href="glossary#overlapping-cluster-analysis">overlapping</a>, in which documents may belong belong to multiple clusters. Hierarchical methods allow <em>clusters</em> to belong to other clusters, whereas <a data-display-content-preview-box="true" href="glossary#flat-cluster-analysis">flat</a> methods do not allow for this possibility.<i> Lexos</i> implements two types of cluster analysis: a form of hierarchical clustering called <a data-display-content-preview-box="true" href="glossary#agglomerative-hierarchical-clustering">agglomerative hierarchical clustering</a> and a form of flat partitioning called <a data-display-content-preview-box="true" href="glossary#k-means-clustering">K-Means</a>.</p><p>While a full discussion of the many types of cluster analysis is beyond the scope of this work, we may note two other methods that are commonly used in literary text analysis. One such method is topic modeling, which generates clusters of words that appear in close proximity to each other within the corpus. The most popular tool for topic modeling is <a href="http://mallet.cs.umass.edu/">Mallet</a>, and the Lexos Multicloud tools allows you to generate topic clouds of the resulting clusters from Mallet data. Another type of clustering is often referred to as <a href="https://en.wikipedia.org/wiki/Community_structure">community detection</a>, where algorithms are used to identify clusters of related nodes in a network. More information about community detection in network data can be found here [needs a link].</p><h2>Strengths and Limitations of Cluster Analysis</h2><p>Cluster analysis has been put to good use for a variety of purposes. Some of the most successful work using cluster analysis has been in the area of authorship attribution, detection of collaboration, source study, and translation (REFs). Above all, it can be a useful provocation, helping to focus enquiry on texts or sections of texts that we might otherwise ignore.</p><p>It is important to be aware that there is not a decisive body of evidence supporting the superiority of one clustering method over another. Different methods often produce very different results. There is also a fundamental circularity to cluster analysis in that it seeks to discover structure in data by imposing structure on it. While these considerations may urge caution, it is equally important to remember that they have analogues in traditional forms of an analysis and interpretation.</p><p>Regardless of which method is chosen, we should be wary of putting too much stock in any single result. Cluster analysis is most useful when it is repeated many times with slight variations to determine which results are most useful. One of the central concerns will always be the <strong>validity</strong> of algorithmically produced clusters. In part, this is a question of statistical validity based on the nature of texts used and the particular implementation chosen for clustering. There is ongoing research on what statistical criteria makes a cluster a &quot;good cluster&quot;--and how to learn that is for a given data set--but there is very little consensus that is of practical use for textual analysis. In Lexos, we include a statistical measure called the <a data-display-content-preview-box="true" href="glossary#silhouette-score">Silhouette Score</a>, which gives a general indication of how well documents lie within their clusters. Silhouette scores do not reply on knowing class labels beforehand.&nbsp;However, a high or low Silhouette Score should not be taken to mean that the clustering is better or worse. It is merely one of many possible measures we could use. [See http://blog.data-miners.com/2011/03/cluster-silhouettes.html for further information in Silhouette Scores.] For further discussion, see <a data-display-content-preview-box="true" href="establishing-robust-clusters">Establishing Robust Clusters</a>.</p><p>There is also the more fundamental question of whether similarity as measured by distance metrics corresponds to similarity as apprehended by the human psyche or similarity in terms of the historical circumstances that produced the texts under examination. One of the frequent complaints about cluster analysis is that, in reducing the dimensionality of the documents within the cluster, it occludes access to the content&mdash;especially the semantics of the content&mdash;responsible for the grouping of documents. Point taken. Treating documents as vectors of term frequencies ignores information, but the success of distance measures on n-dimensional vectors of term counts is clear: cluster analysis continues to support exploration that helps define next steps, for example, new ways of segmenting a set of old texts.</p><h2 id="sources">Sources:</h2><p><a href="http://www.mimuw.edu.pl/~son/datamining/DM/8-Clustering_overview_son.pdf">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.4480&amp;rep=rep1&amp;type=pdf</a></p><p><a href="http://www.mimuw.edu.pl/~son/datamining/DM/8-Clustering_overview_son.pdf">http://www.mimuw.edu.pl/~son/datamining/DM/8-Clustering_overview_son.pdf</a><br /><br /><a href="http://www.daniel-wiechmann.eu/downloads/cluster_%20analysis.pdf">http://www.daniel-wiechmann.eu/downloads/cluster_%20analysis.pdf</a><br /><br /><a href="http://www.stat.wmich.edu/wang/561/classnotes/Grouping/Cluster.pdf">http://www.stat.wmich.edu/wang/561/classnotes/Grouping/Cluster.pdf</a></p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-19T16:34:39+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:838227"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/cluster-analysis"/>
<dcterms:references rdf:resource="http://scalar.usc.edu/works/lexos/media/cluster-analysis-chart"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/media/cluster-analysis-chart">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Media"/>
<scalar:isLive>1</scalar:isLive>
<art:thumbnail rdf:resource="http://scalar.usc.edu/works/lexos/media/ClusterChart1_thumb.PNG"/>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-03T20:05:23+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:170894"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/media/cluster-analysis-chart.1"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/media/cluster-analysis-chart.1"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/media/cluster-analysis-chart.1">
<ov:versionnumber>1</ov:versionnumber>
<dcterms:title>Cluster Analysis Chart</dcterms:title>
<dcterms:description>Illustrates document similarity</dcterms:description>
<art:url rdf:resource="http://scalar.usc.edu/works/lexos/media/ClusterChart1.PNG"/>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-03T20:05:23+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:427078"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/media/cluster-analysis-chart"/>
<dcterms:isReferencedBy rdf:resource="http://scalar.usc.edu/works/lexos/cluster-analysis"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:838227:839333:1">
<scalar:urn rdf:resource="urn:scalar:path:838227:839333:1"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/cluster-analysis.35"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical-clustering.17#index=1"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/hierarchical-clustering">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-09T20:18:09+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:172202"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical-clustering.17"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical-clustering.17"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/hierarchical-clustering.17">
<ov:versionnumber>17</ov:versionnumber>
<dcterms:title>Hierarchical Clustering</dcterms:title>
<dcterms:description>Manual page for the Lexos Hierarchical Clustering tool</dcterms:description>
<sioc:content>Hierarchical cluster analysis is a good first choice when asking new questions about texts. Our experience has shown that this approach is remarkably versatile (REF). Perhaps more than any one individual method, the results from our cluster analyses continue to generate new interesting and focused questions.<p>Hierarchical clustering does not require you to choose the number of clusters to begin with. A dendrogram, a visual representation of the clusters, can be built by two methods. <a data-display-content-preview-box="true" href="glossary#divisive-hierarchical-clustering">Divisive hierarchical clustering</a> begins with only one cluster (consisting of all documents) and proceeds to cut it into separate &ldquo;sub-clusters&rdquo;, repeating the process until the criterion for dividing them has been exhausted. Alternately,&nbsp;<a data-display-content-preview-box="true" href="glossary#agglomerative-hierarchical-clustering">agglomerative hierarchical clustering</a> begins with every document as its own cluster and then proceeds to assign these items to &ldquo;super-clusters&rdquo; based on the selected <a data-display-content-preview-box="true" href="glossary#distance-metric">distance metric</a> and <a data-display-content-preview-box="true" href="glossary#linkage">linkage</a> criteria (see below). Lexos offers a tool for performing agglomerative hierarchical clustering.</p><p>The clusters that result from hierarchical clustering are typically visualized with a two-dimensional tree diagram called a <a data-display-content-preview-box="true" href="http://scalar.usc.edu/works/lexos/glossary#dendrogram">dendrogram</a>.[There probably needs to be a short summary of the terms used in the video here (clade, leaf, simplicifolious, etc.)--in case people don&#39;t want to watch the video.] For more information about the construction and interpretation of dendrograms in this method, see the video below:</p>&nbsp;<p><a class="inline" resource="how-to-read-a-dendrogram" href="https://www.youtube.com/watch?v=MX6AUX1b1w0"></a></p>&nbsp;<p>Since the resulting tree technically contains clusters at multiple levels, the result of the cluster analysis is obtained by &ldquo;cutting&rdquo; the tree at the desired level. Each connected component then forms a cluster for interpretation.</p><p>The results of hierarchical clustering and the topography of the resulting dendrogram may vary depending on <a data-display-content-preview-box="true" href="glossary#distance-metric">distance metric</a>,&nbsp;<a data-display-content-preview-box="true" href="glossary#linkage">linkage</a> criterion used to form the clusters, and other factors such as tokenization and the number of most frequent words used. The distance metric is the measure used for defining what constitutes document similarity, how &quot;far&quot; (distance) one document is from another. [<i style="font-weight: bold;">NOTE: i don&#39;t think this next sentence is correct:&nbsp;</i><b>The linkage criterion specifies which terms in the documents are used to assign the documents to a particular clade.] &nbsp;</b><i>rather</i><b> ...&nbsp;</b>The linkage criterion specifies which distances between documents are used to define how similar a document is to a previously formed cluster. [This is an improvement. I want to get in there that the distances can be between, say subsets of terms, but it seems to be too much for a single sentence. Better to let the individual types of linkage illustrate what is involved.</p><p>Hierarchical clustering presents the user with three main challenges:</p><ol><li>Which distance metric to use.</li><li>What type of linkage criterion to select.</li><li>Where to cut the tree.</li></ol><p>Each of these challenges will be considered in turn.</p><h2>Selecting a Distance Metric</h2><p>This is one of the least well-understood (and least well-documented) aspects of the hierarchical clustering method. Since we are representing texts as document vectors, it makes sense to define document similarity by comparing the two vectors. One way to do this is to select points (terms) on the vectors of two documents and measure the distance between them. If the two vectors are visualized as lines in a triangle, the hypotenuse between these lines can be used as a measure of the distance between the two documents. This standard means of measuring how far apart two documents are is known as <a data-display-content-preview-box="true" href="glossary#euclidean-distance">Euclidean distance</a>. Euclidean distance can be calculated using the square root of the sum of the squares of the differences between corresponding coordinates of points on the document vectors. Despite this mouthful, Euclidean distance is an excellent metric to begin with (and we have had good success with it). Non-Euclidean methods are also possible. For instance, another commonly used measure is <a data-display-content-preview-box="true" href="glossary#cosine-similarity">cosine similarity</a>, which relates the distance between the two documents to the angle between their two vectors. While Euclidean distance will vary depending on which points on the vector are used to calculate the distance, the angle between the vectors does not change. Both of these measures are good starting points. Another is Squared Euclidean distance. This is the same as the Euclidean distance, but it does not take the square root as the final part of the calculation. Because it omits this extra step Squared Euclidean distance can be a good choice for larger data sets that take longer to process. Lexos provides a variety of options for use as distance metrics. Further discussion can be found under <a data-display-content-preview-box="true" href="choosing-a-distance-metric">Choosing a Distance Metric</a>.</p><h2>Choosing a Linkage Method</h2><p>The second choice that must be made before running a clustering algorithm is the linkage method. At each stage of the clustering process a choice must be made about whether two clusters should be joined (and recall that a single document itself forms a cluster at the lowest level of the hierarchy). An intuitive means for doing this is to join the cluster containing a point (e,g, a term frequency) closest to the current cluster. This is known as <a data-display-content-preview-box="true" href="glossary#single-linkage">single linkage</a>, which joins clusters based on only a single point. Single linkage does not take into account the rest of the points in the cluster, and the resulting dendrograms tend to have spread out clusters. This process is called &quot;chaining&quot;. <a data-display-content-preview-box="true" href="glossary#complete-linkage">Complete linkage</a> uses the opposite approach. It takes the two points furthest apart between the current cluster and the others. The cluster with the shortest distance to the current cluster is joined to it. Complete linkage thus takes into account all the points on the vector that come before the one with the maximum distance. It tends to produce compact, evenly distributed clusters in the resulting dendrograms. <a data-display-content-preview-box="true" href="glossary#average-linkage">Average linkage</a> is a compromise between single and complete linkage. It takes the average distance of all the points in each cluster and uses the shortest average distance for deciding which cluster should be joined to the current one. We have had good success with average linkage. The <a data-display-content-preview-box="true" href="glossary#weigted-linkage">weighted average</a> linkage performs the average linkage calculation but weights the distances based on the number of terms in the cluster. It therefore may be a good option when there is significant variation in the size of the documents under examination. Another commonly used form of linkage (not currently available in Lexos) is <a data-display-content-preview-box="true" href="glossary#wards-criterion">Ward&#39;s criterion</a>, which attempts to minimize the differences in cluster size as the dendrogram is built. It may not be appropriate for use with documents of variable size. [I have found these concise but comprehensible accounts mostly at http://academic.reed.edu/psychology/stata/analyses/advanced/agglomerative.html, but I have modified the phrasing.] Visualizations of the differences between the linkage criteria can be seen <a href="http://www.molmine.com/help/algorithms/linkage.htm">here</a>. [We should probably look for or make an example with better graphics.] Which linkage criterion you choose depends greatly on the variability of your data and your expectations of its likely cluster structure. The fact that it is very difficult to predict this in advance may explain why the &quot;compromise&quot; of average linkage has proved successful for us.</p><h2>Cutting the Dendrogram</h2><p>Once the dendrogram has been generated, every document leaf will form its own cluster and all documents will belong to a single cluster at the root. In between, there may be any number of clusters formed at differing levels of the hirerarchy. Not all of these clusters will necessarily be meaningful. For example, if you are trying to test the authorship of Shakespearean plays, it may not be significant that <i>Macbeth</i> and <i>A Midsummer Night&#39;s Dream</i> fall within the same cluster. It will be more interesting if a Renaissance play we do not know to be by Shakespeare falls within a cluster containing the above plays and not into clusters containing plays by other authors. On the other hand, if we are interested in the question of genre, we might be very interested to know whether <i>Richard II</i>, normally considered a history play, clusters with the tragedy of <i>Macbeth</i> or the comedy of <i>A Midsummer Night&#39;s Dream</i>. In practice, these sorts of considerations will cause us to draw a line on the dendrogram (often at a particular branch height) below which we will not consider clusters significant. This is known as cutting the dendrogram. Where to draw the line can be an impressionistic exercise. Like our choice of linkage, it will depend a great deal on our expectations of our data. Lexos provides two methods of aiding us. Lexos automatically cuts the tree at a threshold set to 70% of the larger of the first two rows in the distance matrix. All connected nodes below this threshold will be given a common color. All branches connecting nodes with distances greater than or equal to the threshold are colored blue. [This is from the scipy documentation: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram] Note that this is a default behavior and may therefore not be entirely appropriate for your material. Lexos also allows you to &quot;prune&quot; your dendrogram by restricting the number of leaves displayed. The primary goal of this option is to prevent overlapping labels in dendrograms containing many documents, but it can also help you to identify the most appropriate level to cut our dendrogram.</p><p>It should be clear from the above that interpreting dendrograms requires both an understanding of the choice of implementation and an understanding of the content of the materials being clustered. Furthermore, the structure of of the dendrogram and its interpretation are highly dependent on our expectations about the text we are studying. This epistemological loop is well known in the Humanities, where it is taken for granted that one&#39;s perspective and biases influence interpretation. In hierarchical cluster analysis, the decision-making required for implementation builds these limitations into the method, but hopefully calls attention to them as well.</p><h2>Further Considerations</h2><p>We end with some miscellaneous issues which you should be aware of in choosing hierarchical clustering as a method. First, it does not scale well. If you have a large number of documents, or large documents, the number of computations can theoretically be a strain on a computer&#39;s processing power. We have not yet established a threshold where this becomes problematic (especially since it will vary on different machines), but, if you appear to be encountering this problem, trying a simpler distance metrics like Squared Euclidean may help. If you do manage to produce a dendrogram with large numbers of leaves, you may trouble reading it because the leaf labels overlap. In Lexos, limiting the number of leaves displayed may help.</p><p>These are largely practical situations, but there are also some conceptual ones. In hierarchical clustering, all items (documents and the terms they contain) are forced into clusters, a scenario that may not accurately reflect the relationships of the original texts. Another issue is that hierarchical clustering assigns documents to clusters early during the process and has no method for undoing that partitioning based on data it encounters later. If that appears to be a problem, we suggest trying K-Means clustering, which adjusts cluster membership at each step.</p><p>Statisticians have identified many strengths and shortcomings of hierarchical clustering as a method, and there is ongoing research on the most appropriate distance measures and linkage criteria (much of it using data unlike that employed in literary text analysis). In our test cases, we have typically found that the Euclidean metric with average linkage provides good results. However, Lexos allows you, even encourages you, to apply a number of algorithms and compare the results. This may be one method of establishing whether a particular clustering is valuable. See further <a data-display-content-preview-box="true" href="establishing-robust-clusters">Establishing Robust Clusters</a>.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-22T14:11:12+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:839333"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical-clustering"/>
<dcterms:references rdf:resource="http://scalar.usc.edu/works/lexos/how-to-read-a-dendrogram"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:838227:656924:2">
<scalar:urn rdf:resource="urn:scalar:path:838227:656924:2"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/cluster-analysis.35"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/k-means-clustering.11#index=2"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/k-means-clustering">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-11T22:00:21+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:172448"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/k-means-clustering.11"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/k-means-clustering.11"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/k-means-clustering.11">
<ov:versionnumber>11</ov:versionnumber>
<dcterms:title>K-Means Clustering</dcterms:title>
<dcterms:description>The main overview page for K-means clustering</dcterms:description>
<sioc:content><h2 id="k-means-clustering">K-Means Clustering</h2><p>K-Means clustering partitions a set of documents into a number of groups or clusters in a way that minimizes the variation within clusters. The &quot;K&quot; refers to the number of partitions, so for example, if you wish to see how your documents might cluster into three (3) groups, you would set K=3. [footnote? &nbsp;You know like if you ask a mathematician, &quot;Give me a number, any number&quot; and she replies, &quot;How about K.&quot;]<br /><br /><em>[not sure about this paragraph: Thus it uses variation of the terms within documents, rather than distances between them, to form clusters. Unlike hierarchical clustering, K-Means clustering requires us to choose the number of clusters (K) we wish to produce. However, we do not need to choose a distance metric. As a result, K-Means can be a good alternative to hierarchical clustering for large data sets since it is less computationally intensive.]</em></p><p>When thinking of K-means clustering, we recommend that you think of each of your documents as represented by a single&nbsp;(x,y) point on a two-dimensional&nbsp;coordinate plane. In this view, a cluster is a collection&nbsp;of documents (points) that are close to one another and together form a group. Assigning documents to a specific cluster amounts to determining which cluster &quot;center&quot; is closest to your document.<br /><br />[show an image of circles that contain points thereby forming clusters]<br /><br />The <strong>algorithm</strong> (general procedure or &quot;recipe&quot;) for applying K-means to your collection of documents is described next. Again, the overall goal is to partition your documents into K non-empty subsets.</p><ol><li>Decide on the number of clusters you wish to form. So yes,&nbsp;<em>you</em>&nbsp;must pick a value for K&nbsp;<em>a priori</em>.</li><li>The algorithm will compute&nbsp;a &quot;center&quot; or centroid&nbsp;for each cluster. The centroid is the center (mean point) of a cluster. The procedure for creating centroids at the very start can be varied and is discussed below.</li><li>Assign each of your documents to the cluster with the nearest centroid.</li><li>Repeat steps 2 and 3, thereby&nbsp;re-calculating the locations of centroids for the documents in each cluster and reassigning documents to the cluster with the closest center. The algorithm continues&nbsp;until no documents are reassigned to different clusters.</li></ol><p><label for="max_iter" style="display: block; margin-bottom: 5px; font-size: 14px; line-height: 34px; clear: right; float: left; color: rgb(0, 0, 0); font-family: Lato, sans-serif;">&nbsp;</label><br /><br /><strong>Required Settings:</strong><br /><br /><span style="line-height: 16.64px;"><strong>K value</strong>:&nbsp;There is no obvious way to choose the number of clusters. It can be helpful to perform hierarchical clustering before performing K-Means clustering, as the resulting dendrogram may suggest a certain number of clusters that is likely to produce meaningful results. The K-means procedure is very sensitive to the position of the initial seeds, although employing the K-means++ setting can help to constrain this placement.</span><br /><br /><b>Method of Visualization:</b><br />As mentioned earlier, K-Means clustering is generally visualized on a two-dimensional plane with the distance between cluster members (documents) indicated by coordinates. Trapezoidal polygons known as Voronoi cells may be drawn around the cluster centroids to indicate which documents fall in which clusters. Another way of visualizing the results of K-Means clustering is with Principal Component Analysis (PCA), where dots on the plane are colored to mark their cluster membership. Both visualization approaches can you&nbsp;judge distances between clusters.</p><p><br /><strong>Advanced Settings:</strong><br /><span style="line-height: 16.64px;">Since cluster membership is adjusted at each stage of the process by the re-location of the centroids, the number of iterations required and other factors can be adjusted to select a cutoff point for the algorithm or a desired threshold for convergence of different clusters. As with the initial choice of cluster numbers, there are no hard and fast rules for how these factors should be applied. &nbsp;For most users, we (strongly?) recommend the default settings be used, that is, the user need not enter or change any of these settings and&nbsp;<em>Lexos</em>&nbsp;will apply the default values.</span><br /><br /><strong>Maximum number of iterations:</strong><br />As noted above, the K-means algorithm will continue to re-compute centroids for each cluster until all documents settle down into &quot;final, home&quot; clusters. It is possible that a situation occurs where a document continues to toggle back and forth between two clusters. This value avoids an endless, or at least an unnecessary number of iterations with little change.<br /><br /><strong>Method of Initialization:</strong><br />Your results of using K-means on a collection of documents can vary significantly depending on the&nbsp;<em>initial</em> choice of centroids. &nbsp;In <em>Lexos</em>&nbsp;the user is offered two choices: K-Means++ and Random. When using K-Means++, the default setting in&nbsp;<em>Lexos</em>, the center of the first of the K clusters is chosen at random (typically by picking any one of the documents in the starting set as representative of a center of a future cluster). The remaining (K-1) cluster centers are then&nbsp;chosen from the remaining documents by computing a probability proportional to the distances of the centers already chosen. O<span style="line-height: 16.64px;">nce all centroids are chosen, normal K-Means clustering takes place. </span>A &quot;random seed&quot; approach is used in which the locations of all&nbsp;centroids at the initial&nbsp;stage are generated randomly. It is best to experiment multiple times with different random seeds.&nbsp;</p><p><br /><strong><span style="color: rgb(0, 0, 0); font-family: Lato, sans-serif; font-size: 14px; line-height: 34px;">Number of Iterations with Different Centroids: &nbsp;</span></strong><span style="color: rgb(0, 0, 0); font-family: Lato, sans-serif; font-size: 14px; line-height: 34px;">Given the sensitivity of final clusters on the choice of initial centroids,&nbsp;<em>Lexos</em>&nbsp;uses a default setting of running <strong>?? (note: scikit learn says N=10; we have it set at 300?)</strong>&nbsp;trials, each trial using different centroid starting locations (or seeds).<br /><br /><strong>Relative Tolerance:</strong><br />This setting allows an expert user to vary the rate of convergence of the algorithm. <em>(I&#39;m not convinced this is a useful setting for our users?)</em></span><br /><br /><br /><span style="line-height: 1.6;">The reliability of K-Means clustering can be evaluated by many statistical procedures. Lexos provides one criterion, the Silhouette Score, which is also used to evaluate the reliability of results follwoing&nbsp;hierarchical clustering. See </span><a data-display-content-preview-box="true" style="line-height: 1.6;" href="cluster-analysis">Cluster Analysis</a><span style="line-height: 1.6;"> for further discussion.</span></p><br /><br />&nbsp;</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3689"/>
<dcterms:created>2016-03-12T12:32:09+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:656924"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/k-means-clustering"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:838227:431807:3">
<scalar:urn rdf:resource="urn:scalar:path:838227:431807:3"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/cluster-analysis.35"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/choosing-a-distance-metric.5#index=3"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:838227:838304:4">
<scalar:urn rdf:resource="urn:scalar:path:838227:838304:4"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/cluster-analysis.35"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/establishing-robust-clusters.4#index=4"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/establishing-robust-clusters">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-12T00:21:10+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:172452"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/establishing-robust-clusters.4"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/establishing-robust-clusters.4"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/establishing-robust-clusters.4">
<ov:versionnumber>4</ov:versionnumber>
<dcterms:title>Establishing Robust Clusters</dcterms:title>
<dcterms:description>Detailed discussion of how to handle cluster robustness</dcterms:description>
<sioc:content>One of the most vexing questions in the use of cluster analysis for computational stylistics is how we distinguish &quot;good&quot; clusters from clusters that are mere &quot;noise&quot;, whether generated by our data or by our choice of implementations? Ideally, we want to generate &quot;robust&quot; clusters, by which we mean that they stand up to some measure of scrutiny. We can define this in many ways. If we cut several documents into segments and the individual segments of each document are clustered together in opposition to segments of other documents, we can assume that the clustering process has captured something meaningful, if only the distinctiveness of our original documents. When less predictable effects occur&mdash;say one segment clusters with the &quot;wrong&quot; document&mdash;we have to conclude either that there is something sub-optimal about our clustering procedure or that we have found something really interesting. Thus our intuitive sense of &quot;surprise&quot; at our results may be a measure of a weak clustering, but this &quot;surprise&quot; is also the goal of our analysis&mdash;within reason. Below we discuss some methods of striking a balance between interpretations based on unexpected clusterings. We examine how we can be relatively sure that our clusters&mdash;and thus our conclusions based on them&mdash;are robust.<br /><br />The Holy Grail for some would be a statistical measure of with which to assess the &quot;validity&quot; of our clusters. A number of such measures exist, but their usefulness for a wide variety of data, and for the types of questions humanists typically ask of their data is an open question. Lexos offers one measure, the <a href="silhouette-scores">Silhouette Score</a>, which attempts to quantify our confidence that individual documents have been assigned to the &quot;correct cluster&quot;. However, we recommend that you integrate non-statistical approaches into your workflow. Creating a number of different cluster analyses with slightly different settings to see how well the clusters hold up to these &quot;tweaks&quot; is probably the most reliable way to establish confidence in your clusters. Drout et al. have outlined a variety of procedures in <a target="_blank" href="http://www.palgrave.com/us/book/9783319306278">Beowulf Unlocked: New Evidence from Lexomic Analysis (2016)</a>. [Extracts or summaries should be added here.]</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-19T17:45:48+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:838304"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/establishing-robust-clusters"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/cut">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-02T07:38:01+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:158924"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/cut.7"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/cut.7"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/cut.7">
<ov:versionnumber>7</ov:versionnumber>
<dcterms:title>The Cutter Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos Cutter tool</dcterms:description>
<sioc:content><p>The Lexos <strong>Cutter</strong> tool allows you to divide your texts into multiple segments. Each segment is treated by Lexos exactly like any other document. You can perform individual scrubbing actions, create word clouds of segments, and cluster the segments of documents just as you would any other text.</p><h3>Cutting Options</h3><p>Lexos gives you numerous options for designating where document should be cut into segments. The options are detailed below.</p><h4><u>Characters/Segment</u></h4><p>This option allows you to designate the number of characters you wish to be included in each segment. When the <strong>Characters/Segment</strong> radio button is clicked, the <strong>Segment Size</strong>, <strong>Overlap</strong>, and <strong>Last Segment Size Threshold</strong> options become visible. <strong>Segment Size</strong> refers to the number of characters you wish to include in each segment. Lexos will begin a new segment when it reaches the number of characters you designate before starting over at the next segment. <strong>Overlap</strong> allows you to specify an area of overlap between each segment. For instance, if you choose a segment size of 1000 characters and an overlap of 10 characters. Segment 1 will end at 1000 and Segment 2 will begin at 990. The <strong>Last Segment Size Threshold</strong> option provides a method of handling circumstances where the final segment does not reach the number of characters in the designated segment size. The default setting is to treat this final segment as a separate segment if it is 50% or more of the length of the designated segment size. If not, the entire final segment will be attached to the previous one. Changing the <strong>Last Segment Size Threshold</strong> percentage allows you to customize this behavior.</p><h4><u>Lines/Segment</u></h4><p>If your documents contain line breaks, you may use them to indicate where Lexos performs cutting actions. The <strong>Segment Size</strong> option allows you to choose the number of lines after which Lexos will perform a cut. All the other options work exactly the same as for the <strong>Characters/Segment</strong> option, except that they work by counting lines instead of characters.</p><h4><u>Tokens/Segment</u></h4><p>Lexos can perform cutting actions based on the number of tokens per segment. By default, it treats space-separated strings of characters as tokens, but this behavior can be modified by changing the settings in the <strong>Tokenizer</strong> tool. This will allow you to use n-grams as your tokens. Apart from using tokens as the unit for measuring segment size, all other options work exactly the same as for the <strong>Characters/Segment</strong> option.</p><h4><u>Segments/Document</u></h4><p>This option divides documents into a designated number of evenly-sized segments, regardless of the length of the document. Where the last segment is shorter than the others, Lexos applies a 50% <strong>Last Segment Size Threshold</strong> percentage as described under <strong>Characters/Segment</strong> above.</p><h4><u>Cut by Milestone</u></h4><p>This option allows you to assign a text string occurring in the document to use as a delimiter between segments. Typically, these &ldquo;milestone&rdquo; strings will be placed at appropriate locations in text files before they are uploaded to Lexos. For instance, you might add the string &ldquo;CHAPTER&rdquo; at the beginning of every chapter in a novel and then supply &ldquo;CHAPTER&rdquo; as the milestone term. Lexos will then perform a cut everytime it encounters this term, allowing you to divide your novel into individual documents for each chapter. Note that you must be careful to select a milestone term that does not occur anywhere as part of the text of your documents. Milestones are not counted as terms in the Document-Term Matrix (DTM).</p><h3>Cutting your Documents</h3><p>Once you have selected the cutting options you desire, click the <strong>Preview Cuts</strong> button to see the results in the preview window. If you are happy with the cuts performed by Lexos, click the <strong>Apply Cuts</strong> button. This will create new documents with the same name as the original followed by a number for each segment. Each segment will appear as a new document in the <strong>Manage</strong> tool. Once cutting is applied, the original document is de-activated and the new segments are made active documents. In addition, once cuts are applied, each segment acquires an <strong>Individual Options</strong> button in the preview window. Clicking this button opens a version of the cutting options form in the main Cutter tool which allows you to apply cuts to each segment individually.</p><p>You can download the new document segments by clicking the <strong>Download Cut Files</strong> button.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<scalar:continue_to_content_id>158924</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-13T18:35:26+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:831957"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/cut"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/cutting">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-19T01:37:40+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:174492"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/cutting.2"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/cutting.2"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/cutting.2">
<ov:versionnumber>2</ov:versionnumber>
<dcterms:title>Cutting</dcterms:title>
<dcterms:description>The main starting page for Cutting topics</dcterms:description>
<sioc:content>Cutting topics go here.<br /><br />This path has not yet been developed.</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-22T17:42:59+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:839398"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/cutting"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/epistemology">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-02-27T13:19:08+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:389750"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/epistemology.1"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/epistemology.1"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/epistemology.1">
<ov:versionnumber>1</ov:versionnumber>
<dcterms:title>Epistemology</dcterms:title>
<dcterms:description>The beginning of a thread on interpreting the results of computational text analysis</dcterms:description>
<sioc:content>This is just the beginning of a thread on interpreting the results of computational text analysis. For now, we&#39;re just posting relevant links.<ul><li><a target="_blank" href="https://zentralwerkstatt.github.io/index.html?post=post_vsm_new">Fabian Offert, &quot;Intuition and Epistemology of High-Dimensional Vector Space 1: Solving is Visualizing.&quot; Zentralwerkstatt (February 22, 2017)</a>.</li></ul></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-02-27T13:19:08+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1044728"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/epistemology"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/glossary">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-11T10:03:11+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:160761"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/glossary.15"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/glossary.15"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/glossary.15">
<ov:versionnumber>15</ov:versionnumber>
<dcterms:title>Glossary</dcterms:title>
<dcterms:description>Glossary of terms used in Lexos and In the Margins</dcterms:description>
<sioc:content><p>This page is intended to provide definitions for the terms used within the Lexos suite, as well as to disambiguate terms drawn from natural language, programming languages, and linguistic analysis. New entries are being added on an ongoing basis.</p><p><a name="agglomerative-hierarchical-clustering"></a> <strong><u>Agglomerative Hierarchical Clustering</u></strong></p><p><a name="character"></a> <strong><u>Character</u></strong></p><p>A character is any individual symbol. The letters that make up the Roman alphabet are characters, as are non-alphabetic symbols such as the Hanzi used in Chinese writing. In Lexos, the term <em>character</em> generally refers to countable symbols.</p><p><a name="community-detection"></a> <strong><u>Community Detection</u></strong></p><p><a name="cosine-similarity"></a> <strong><u>Cosine Similarity</u></strong></p><p><a name="cutting"></a> <strong><u>Cutting</u></strong></p><p><a name="dendrogram"></a> <strong><u>Dendrogram</u></strong></p><p><a name="dimensionality-reduction"></a> <strong><u>Dimensionality Reduction</u></strong></p><p><a name="distance-metric"></a> <strong><u>Distance Metric</u></strong></p><p><a name="document"></a> <strong><u>Document</u></strong></p><p>In Lexos, a document is any collection of words (known as terms in Lexos) or characters collected together to form a single item within the Lexos tool. A document is distinct from a file in that the term document refers specifically to the items manipulated within the Lexos software suite, as opposed to file, which refers to the items that are either uploaded from or downloaded to a user&rsquo;s device.</p><p><a name="edit-distance"></a> <strong><u>Edit Distance</u></strong></p><p><a name="euclidean-distance"></a> <strong><u>Euclidean Distance</u></strong></p><p><a name="exclusive-cluster-analysis"></a> <strong><u>Exclusive Cluster Analysis</u></strong></p><p><a name="feature-selection"></a> <strong><u>Feature Selection</u></strong></p><p><a name="file"></a> <strong><u>File</u></strong></p><p>File refers to items that can be manipulated through the file manager on a user&rsquo;s computer i.e. windows explorer, archive manager, etc. File is only used in the Lexos suite when referring to functions that involve the user&rsquo;s file system, such as uploading or downloading.</p><p><a name="flat-cluster-analysis"></a> <strong><u>Flat Cluster Analysis</u></strong></p><p><a name="hapax-legomena"></a> <strong><u><em>Lapax Legomena</em></u></strong></p><p>A term occurring only once in a document or corpus.</p><p><a name="hierarchical-cluster-analysis"></a> <strong><u>Hierarchical Cluster Analysis</u></strong></p><p><a name="k-means-clustering"></a> <strong><u>K-Means Clustering</u></strong></p><p><a name="lemma"></a> <strong><u>Lemma</u></strong></p><p>The dictionary headword form of a word. For instance, &ldquo;cat&rdquo; is the lemma for &ldquo;cat&rdquo;, &ldquo;cats&rdquo;, &ldquo;cat&rsquo;s&rdquo;, and &ldquo;cats&rsquo;&rdquo;. Lemmas are generally used to consolidate grammatical variations of the same word as a single term, but they may also be used for spelling variants.</p><p><a name="lexomics"></a> <strong><u>Lexomics</u></strong></p><p>The term &ldquo;lexomics&rdquo; was originally used to describe the computer-assisted detection of &ldquo;words&rdquo; (short sequences of bases) in genomes,<sup><a href="http://www.jstor.org/stable/10.1086/668252#fn15">*</a></sup> but we have extended it to apply to literature, where lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. Using statistical methods and computer-based tools to analyze data retrieved from electronic corpora, lexomic analysis allows us to identify patterns of vocabulary use that are too subtle or diffuse to be perceived easily. We then use the results derived from statistical and computer-based analysis to augment traditional literary approaches including close reading, philological analysis, and source study. Lexomics thus combines information processing and analysis with methods developed by medievalists over the past two centuries. We can use traditional methods to identify problems that can be addressed in new ways by lexomics, and we also use the results of lexomic analysis to help us zero in on textual relationships or portions of texts that might not previously have received much attention.</p><p><a name="n-gram"></a> <strong><u>N-gram</u></strong></p><p>An n-gram is a string of one or more tokens delimited by length. N-grams can be characters or larger tokens (e.g. space-bounded strings typically equivalent to words in Western languages). A one-character n-gram is described as a 1-gram or uni-gram. There are also 2-grams (bi-grams), 3-grams (tri-grams), 4-grams, and 5-grams. Larger n-grams are rarely used. Using n-grams to create a sliding window of characters in a text is one method of counting terms in non-Western languages (or DNA sequences) where spaces or other markers are not used to delimit token boundaries.</p><p><a name="normalization"></a> <strong><u>Normalization</u></strong></p><p><a name="overlapping-cluster-analysis"></a> <strong><u>Overlapping Cluster Analysis</u></strong></p><p><a name="partitioning-cluster-analysis"></a> <strong><u>Partitioning Cluster Analysis</u></strong></p><p><a name="rolling-window-analysis"></a> <strong><u>Rolling Window Analysis</u></strong></p><p><a name="scrubbing"></a> <strong><u>Scrubbing</u></strong></p><p><a name="segment"></a> <strong><u>Segment</u></strong></p><p>After cutting a text in Lexos, the separated pieces of the text are referred to as segments. However, segments are treated by Lexos as documents and they may be referred to as documents when the focus is not on their being a part of the entire text.</p><p><a name="similarity"></a> <strong><u>Similarity</u></strong></p><p><a name="sparse-matrix"></a> <strong><u>Sparse Matrix</u></strong></p><p><a name="standard-deviation"></a> <strong><u>Standard Deviation</u></strong></p><p><a name="standard-error-test"></a> <strong><u>Standard Error Test</u></strong></p><p><a name="stopword"></a> <strong><u>Stopword</u></strong></p><p><a name="supervised-learning"></a> <strong><u>Supervised Learning</u></strong></p><p><a name="term"></a> <strong><u>Term</u></strong></p><p>A term is the unique form of a token. If a <strong>token</strong> &ldquo;cat&rdquo; occurs two times in a document, the <strong>term</strong> count for &ldquo;cat&rdquo; is 2. In computational linguistics, terms are sometimes called &ldquo;types&rdquo;, but we avoid this usage for consistency.</p><p><a name="text"></a> <strong><u>Text</u></strong></p><p>Text is a general term used to refer to the objects studied in lexomics, irrespective of the form. It thus may refer to either a file or documents, but it is typically used to refer to the whole work, rather than smaller segments.</p><p><a name="token"></a> <strong><u>Token</u></strong></p><p>A token is an individual string of characters that may occur any number of times in a document. Tokens can be characters, words, or n-grams (strings of one or more characters or words).</p><p><a name="tokenization"></a> <strong><u>Tokenization</u></strong></p><p>The process of dividing a text into <em>tokens</em>.</p><p><a name="type"></a> <strong><u>Type</u></strong></p><p>See <strong>term</strong>.</p><p><a name="unicode"></a> <strong><u>Unicode</u></strong></p><p><a name="unsupervised-learning"></a> <strong><u>Unsupervised Learning</u></strong></p><p><a name="word"></a> <strong><u>Word</u></strong></p><p>A word is, in many Western languages, a set of characters bounded by whitespace or punctuation marks, where whitespace refers to one or more spaces, tabs, or new-line inserts. However, to avoid ambiguity when dealing with many non-Western languages such as Chinese, where a single Hanzi character can refer to the equivalent of an entire Western word, <em>term</em> is used throughout throughout the Lexos interface and documentation in place of <em>word</em>. There are a few exceptions where &ldquo;word&rdquo; is used because it is part of an established phrase, it is less awkward, or because the context refers to the semantic category of words.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-22T17:44:56+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:839401"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/glossary"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/handling-entities">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-01-24T21:00:47+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:377329"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/handling-entities.1"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/handling-entities.1"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/handling-entities.1">
<ov:versionnumber>1</ov:versionnumber>
<dcterms:title>Handling Entities</dcterms:title>
<dcterms:description>Instruction for handling HTML, XML, and SGML Entities</dcterms:description>
<sioc:content>Texts in HTML, XML, and SGML typically encode special characters with <a name="cke-scalar-empty-anchor" target="_blank" href="https://en.wikipedia.org/wiki/Numeric_character_reference">numeric character references</a>. In these markup languages, entities are typically represented using codes beginning with <code>&amp; </code> and ending with <code>; </code>. These codes may be in decimal or hexadecimal format. For instance, the letter&nbsp;<em>&AElig;</em> may be represented a<code> &amp;#198;</code> (decimal) or<code> &amp;#xC6;</code>(hexadecimal). Additionally, texts in these formats may use <a href="https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references" target="_blank">character entity references</a> such as <code>&amp;AElig;</code>, which can also be used to encode&nbsp;<em>&AElig;</em> in HTML. Collectively, these refernces are often referred to as &quot;entities&quot;. Web browsers will automatically display the single-character equivalents of these entities if they are part of the HTML standard and/or are available in the display font.<br /><br />By default, the Lexos scrubbing tool leaves character entities alone, but this can lead to unexpected behaviors in combination with the <strong>Remove All Punctuation</strong> option. When that option is applied, an entity like <code>&amp;AElig;</code> will become <code>AElig</code> and may end up looking just like a word to Lexos&#39; counting functions. HTML and XML texts are particularly likely to contain entities like <code>&amp;amp; </code>for &quot;&amp;&quot; or <code>&amp;quot;</code> for curly quotation marks.<br /><br />If you wish to preserve these entities and still remove punctuation marks, you must convert them to their single character <a href="https://en.wikipedia.org/wiki/Unicode" target="_blank">Unicode</a> equivalents first. Lexos allows you to do this with the <strong>Special Characters</strong> option. This replaces entities before punctuation marks are stripped, making it safe to remove punctuation.</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-01-24T21:00:47+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1013550"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/handling-entities"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/hierarchical">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-19T17:41:21+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:318944"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical.3"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical.3"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/hierarchical.3">
<ov:versionnumber>3</ov:versionnumber>
<dcterms:title>The Hierarchical Clustering Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos Hierarchical Clustering tool</dcterms:description>
<sioc:content><p>The Lexos <strong>Hierarchical Clustering</strong> tool performs hierarchical agglomerative cluster analysis on your active documents and produces a visualization of this analysis in the form of a dendrogram (tree diagram). The most important options are the <strong>Distance Metric</strong> (method of measuring the distance between documents) and <strong>Linkage Method</strong> (method of determining when documents will be attached to a cluster) dropdown menus. Lexos uses Euclidean distance and average linkage as defaults. For further details about how to choose a distance metric and linkage method, see the topics discussion on <a href="http://scalar.usc.edu/works/lexos/hierarchical-clustering" target="_blank">Hierarchical Clustering</a><a href="hierarchical-clustering"></a>.</p><p>The remaining options allow you to configure the appearance of the dendrogram. You may supply a <strong>Dendrogram Title</strong>, which will be displayed at the top of the graph and select the <strong>Dendrogram Orientation</strong> (vertical or horizontal). In our experience, vertically-oriented dendrograms are easier to interpret. However, when they have many leaves, the labels tend to overlap and become unreadable. Horizontal dendrograms may produce slightly better results. Another approach is to limit the <strong>Number of Leaves</strong> displayed in the dendrogram. Reducing this number will collapse the most closely related clusters (those lower down on the dendrogram), showing only the larger groups. A numbered label in parentheses will show how many leaves have been collapsed into single branch. See below for other strategies for producing more readable dendrograms.</p><p>The <strong>Show Branch Height in Dendrogram</strong> option will place red nodes at the top of each clade labelled with the height (length) of the clade branches from the leaf node. See the <a resource="how-to-read-a-dendrogram" data-annotations="" data-caption="description" data-align="right" data-size="small" href="https://www.youtube.com/watch?v=MX6AUX1b1w0">How to Read a Dendrogram</a> video for the interpretation of branch height. The <strong>Show Legends in Dendrogram</strong> will add to the dendrogram image a series of annotations showing the options you have selected.</p><p>All of the <a href="advanced-options">Advanced Options</a> for manipulating the Document-Term Matrix (DTM) are available in the <strong>Hierarchical Clustering</strong> tool. There are also options for generating a <em>Silhouette Score</em>,&nbsp; measure of determining cluster robustness. <strong>Silhouette Score Options</strong> are discussed below.</p><p><strong>Important</strong>: Due to a limitation in the <a target="_blank" href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html">scipy clustering package</a> employed by Lexos to plot dendrograms, leaf labels containing non-Roman or other special characters will most likely appear as question marks. If this is the case, we recommend using the <a href="advanced-options">Advanced Options</a> <strong>Temporary Labels</strong> function to ensure that your leaf labels clearly identify your documents. We hope to address this limitation in future versions of Lexos.</p><p>Once you have selected your options, click the <strong>Get Dendrogram</strong> button. After the dendrogram appears, you can click on it top open it in a new window.</p><h3>Silhouette Scores</h3><p>Silhouette scores give a general indication of how well individual objects lie within their cluster and are thus one method of <a href="establishing-robust-clusters">measuring cluster robustness</a>. A score of 1 indicates tight, distinct clusters. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.</p><p>To generate a silhouette score for your dendrogram, click on the <strong>Silhouette Score Options</strong> menu. You may set the <strong>Maximum Number of Clusters</strong> to between 2 and the number of active documents in our session. After setting this number, click the green <strong>Get Dendrogram</strong> button, and the silhouette score will appear above the button. Further information can be found in the topics article on <a href="silhouette-scores">Silhouette Scores</a>.</p><h3>Downloading Dendrograms</h3><p>Lexos allows you to download dendrogram images in a number of formats (PDF, PNG, and SVG). To download dendrogram image, click the appropriate button on the right side of the screen.</p><p>Lexos uses the <a target="_blank" href="http://docs.scipy.org/doc/scipy/reference/genera/scipy.cluster.hierarchy.dendrogram.html">scipy clustering package</a> to plot dendrograms, and this has some severe limitations in the type of output available. There are many other tools available which allow you to explore and manipulate dendrograms once you have done your cluster analysis. These tools typically allow you to import pre-existing dendrogram (tree) structure in <a target="_blank" href="https://en.wikipedia.org/wiki/Newick_format">Newick format</a>: a text file representing the hierarchical structure using parentheses and commas. Lexos also provides <strong>Newick</strong> download button which will convert your dendrogram&#39;s structure to a text file in Newick format. You can then upload this file in external tools. Note, however, that many external dendrogram plotting tools do not seem to preserve branch height.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-01-05T13:11:05+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:999384"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/how-to-read-a-dendrogram">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Media"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-06-16T08:55:01+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:161357"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/how-to-read-a-dendrogram.2"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/how-to-read-a-dendrogram.2"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/how-to-read-a-dendrogram.2">
<ov:versionnumber>2</ov:versionnumber>
<dcterms:title>How to Read a Dendrogram</dcterms:title>
<dcterms:description>YouTube video tutorial of how to read a dendrogram</dcterms:description>
<art:url rdf:resource="https://www.youtube.com/watch?v=MX6AUX1b1w0"/>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-06-16T08:56:32+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:402159"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/how-to-read-a-dendrogram"/>
<dcterms:isReferencedBy rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical-clustering"/>
<dcterms:isReferencedBy rdf:resource="http://scalar.usc.edu/works/lexos/hierarchical"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/how-to-run-lexos">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-23T10:42:30+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:319255"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/how-to-run-lexos.2"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/how-to-run-lexos.2"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/how-to-run-lexos.2">
<ov:versionnumber>2</ov:versionnumber>
<dcterms:title>How to Run Lexos</dcterms:title>
<dcterms:description>Instructions for using Lexos online or on localhost</dcterms:description>
<sioc:content><p>Lexos is a web-based tool designed for transforming, analyzing, and visualizing texts. Lexos is designed for use primarily with small to medium-sized text collections, and especially for use with ancient languages and languages that do not employ the Latin alphabet. Lexos was created as an entry-level platform for Humanities scholars and students new to computational techniques while providing tools and techniques sophisticated enough for advanced research.</p><p>Lexos runs through your web browser and currently, Lexos supports Google Chrome and Mozilla Firefox; other browsers may not function properly. You my choose of the following methods of running Lexos:</p><ol><li>Use the online installation hosted by the Lexomics project at <a target="_blank" href="http://lexos.wheatoncollege.edu/">http://lexos.wheatoncollege.edu/</a>. This is very convenient, but you may suffer uploading or processing delays based on fluctuations in internet speed.</li><li><a target="_blank" href="http://wheatoncollege.edu/lexomics/lexos-installers/">Download and Install Lexos</a> using one of the methods provided on the Lexomics website (either use an auto-installer, follow manual instructions, or clone the GitHub repository). This method requires you to install the Python programming language on your computer. Lexos runs in a &quot;localhost&quot; web server on your machine, which may be faster than communicating with the Lexomics server. Running Lexos on your computer also provides the option to to use &quot;local mode&quot;, which does not require internet access (see below).</li></ol><p>Both methods have their advantages and disadvantages. If you are a beginner, we suggest that you get to know Lexos using the online version. Later, you can download Lexos and run it locally for greater speed.</p><h3>Using Local Mode</h3><p>Many functions in Lexos are based on common Javascript libraries like jQuery and Twitter Bootstrap, which are employed all over the internet. So, chances are that your browser has cached these libraries already and doesn&#39;t need to load them, which makes loading times much faster. But we can&#39;t rely on it. So, even if you are running Lexos on your own computer using localhost, Lexos still requires an active internet connection to download these Javascript libraries. Most of the time, this is not an issue.</p><p>But what if you don&#39;t have an internet connection? You can still run Lexos locally on your computer. Lexos has all the Javascript libraries built in and will switch to them if you put it in &quot;local mode&quot;. All you have to do is find the Lexos folder on your computer and open the file <code>config.cfg</code> in a text editor. Change <code>LOCAL_MODE = False</code> to <code>LOCAL_MODE = True</code> (be careful, it is case sensitive); then save the file. You can ignore the other settings. If you are already running Lexos, quit form it by typing <code>Control+C</code> on the command line and then restart it by typing <code>python lexos.py</code>. (See the <a target="_blank" href="http://wheatoncollege.edu/lexomics/lexos-installers/">Manual Installation instructions</a> on the Lexomics website if you need help with this.) You will now be running in local mode.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-23T15:57:03+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:839689"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/how-to-run-lexos"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/index">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<scalar:banner>media/BeoEthThorn1000WordAve.JPG</scalar:banner>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-02T08:03:42+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:158930"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/index.62"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/index.62"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/index.62">
<ov:versionnumber>62</ov:versionnumber>
<dcterms:title>Welcome</dcterms:title>
<dcterms:description>The In the Margins home page</dcterms:description>
<sioc:content><p><em>In the Margins</em> is a <a href="http://scalar.usc.edu/">Scalar</a> book which serves as a companion for Lexomic research and the <a target="_blank" href="http://lexos.wheatoncollege.edu">Lexos</a> literary text analysis software. The online version of the Lexos software is available at <a href="http://lexos.wheatoncollege.edu/upload">http://lexos.wheatoncollege.edu</a>. Our passions for tool-building have intersected with our interest in two questions:</p><blockquote><em>How can we explore the growing impact that quantitative and algorithmic approaches are having on the Humanities?</em></blockquote><blockquote><em>How can we make the discussion part of the tool and the tool part of the discussion?</em></blockquote><p><em>Lexomics</em> is our name for certain methods of stylistic analysis (sometimes called stylometry). This type of analysis harnesses the power of modern computing and statistical techniques to investigate Humanities-based questions such as authorship attribution or textual lineage. Lexomic methods complement traditional Humanities methods of literary interpretation, rather than replacing these challenges. We note that our small but spirited team exists within a much larger community of scholars who continue to influence our team greatly (<em>cf.</em> Eder, Craig, Jockers, Hoover, Liu, Sinclair and Rockwell, <em>et al.</em>).</p><p>The role of Lexos is to help readers of literature identify and explore patterns in texts, thereby opening up new questions and new avenues of research. Lexos provides an integrated workflow of pre-processing, analytical, and visualization tools which allow students and scholars of literature to detect and explore patterns in their texts. <a href="http://lexos.wheatoncollege.edu">Lexos</a> is freely available for use online (perhaps the best choice for first and occassional users) and it may also be downloaded and installed locally for better performance (installation instructions are available <a target="_blank" href="https://github.com/WheatonCS/Lexos/tree/master/0_InstallGuides">here</a>).</p><p>The aim of Lexos is to create an entry-level environment for Lexomic scholarship, one simple enough to be used easily by the casual student but powerful enough for the advanced professor to use in creating new knowledge and insight. Lexos was created for use with small to medium-sized collections of texts (rather than large text corpora or &quot;big data&quot;), and for use with languages that have non-standard or non Latin-based spelling systems. Most of the early Lexomic research was done on medieval English texts. Doing statistical analysis on texts of these types creates certain challenges, both theoretical and practical, and Lexos developed as a way to explore them.</p><p>These issues form part of a wider set of questions we can ask about how computational tools can be used in the Humanities: where are the opportunities, what are the effective practices, and what are the limitations? These questions are not new with us of course, and the wider field is too large to cite here, but <em>In the Margins</em> is our effort to bring the choice of and discussion about methodological decisions to the fore. Our companion documentation, <em>In the Margins,</em> exists not only as a &quot;how to&quot; guide for using Lexos but also as a means to elicit community commentary of effective practices when making the many decisions during the workflow (e.g., how to handle punctuation, count words, and select metrics). <em>In the Margins</em> can be explored directly from its <a href="http://scalar.usc.edu/works/lexos/index">Scalar website</a>, but we also make use of Scalar&#39;s Application Programming Interface (API) to embed <em>In the Margins</em> content directly in Lexos. We think it is important that Lexos not become a &quot;black box&quot; into which users feed their texts and from which they obtain results uncritically. By making the discussion part of the tool and the tool part of the discussion, we aim to make Lexos a more rigorous and powerful tool, one in which we can explore more generally the growing impact that quantitative and algorithmic approaches are having on the Humanities.</p></sioc:content>
<scalar:defaultView>book_splash</scalar:defaultView>
<scalar:continue_to_content_id>173671</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-07-07T16:24:44+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1257313"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/index"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1257313:1257324:1">
<scalar:urn rdf:resource="urn:scalar:path:1257313:1257324:1"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/index.62"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/learn-more.11#index=1"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/learn-more">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-16T16:06:26+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:173700"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/learn-more.11"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/learn-more.11"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/learn-more.11">
<ov:versionnumber>11</ov:versionnumber>
<dcterms:title>Learn More about In the Margins</dcterms:title>
<sioc:content><p><i>In the Margins</i> is the Lexomics Research Group&rsquo;s attempt to position the process of computational<i> </i>literary text analysis side by side with its product, whether it be the tool used for or the results obtained from such analysis. This is particularly important for entry-level users and those whose training has not explored the issues raised by computational methods of studying literature. Our text analysis tool, Lexos, is designed for use by newcomers to the field while empowering them to do sophisticated work in relatively little time. But with power comes a price--it must be employed critically. Too often text analysis tools elide aspects of the text analysis process, drawing attention away from the many steps and decisions required both before and after the use of the tool which can impact the results. Documentation tends to focus on how to use the software, rather than how or why it would be used in specific circumstances. Discussion of this sort may exist in other forums, but the separation between the discussion and the tool tends to make the latter function as a &ldquo;black box&rdquo;. This can ultimately feed tensions between theoretical traditions prevalent in the Humanities and the use quantitative methods that often have their origins in other disciplines. <i>In the Margins</i> answers Johanna Drucker&rsquo;s call for Digital Humanities to &ldquo;synthesize method and theory into ways of doing as thinking&rdquo; by designing tools that embody humanists&rsquo; value of &ldquo;debate, commentary, and interpretive exposition&rdquo; (2012).</p><p>A central feature of our approach is the creation of a seamless transition between the tool, the documentation, and the discussion. <i>In the Margins</i> contains both instructions for how to use Lexos and discussion about why particular steps or decisions might be taken in the analytical process. This content is then embedded within Lexos user interface so that the user is always aware of the need for reflection about the process. Although <i>In the Margins</i> can be explored directly in the Scalar publishing platform, much of its content is also accessible from within Lexos, <i>in situ</i>, so that the user is more easily able to find information about the implications and best practices for any given function and to reflect upon these issues as part of his or her process. <i>In the Margins</i> embraces the design challenge of providing text, expert commentary, and screen-demos from within the Lexos workflow in order to offer commentary as close to the user&rsquo;s current task as possible. This commentary comes from the Lexomics Research Group and an array of outside exports, and we hope that the content will grow over time. If you are interested in providing content for <i>In the Margins</i>, please contact us.</p><p>The use of the Scalar publishing platform allows us to make <i>In the Margins</i> content available both within Lexos (using Scalar&rsquo;s API) and separately on the web for use as a resource by those who may be using other tools or approaches. Scalar organizes content into &ldquo;paths&rdquo;, which are like chapters of a book, except that individual pages can appear in multiple paths and paths can fork into other paths. Scalar provides methods of visualizing this structure to allow user to navigate the paths. In addition to the current path, <i>In the Margins</i> provides a path about Lexomics and a path about Lexos. Most pages can be accessed from one of these paths, but a few, mostly those focusing specifically on providing instructions for using the Lexos tool, are accessible only from within Lexos.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-07-07T16:31:05+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1257324"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/learn-more"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/interface">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-15T14:58:12+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:314525"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/interface.2"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/interface.2"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/interface.2">
<ov:versionnumber>2</ov:versionnumber>
<dcterms:title>The Lexos Interface</dcterms:title>
<dcterms:description>Manual page for the Lexos Interface</dcterms:description>
<sioc:content>The Lexos interface is designed to be simple to use, to emphasize the Lexomics workflow, and to make the many decisions required in performing computational text analysis as transparent as possible. As of version 3.0, it consists of 14 tools, all of which are accessible from the navigation menu at the top of the screen. The banner identifies which tool you are in through the use of curly braces, e.g. &quot;Lexos{Scrubber}&quot;. Each tool is part of a component of the workflow, and the current component is highlighted in light blue in the menu.<br /><br />When you start Lexos in your web browser, a session folder is created to contain all your files and settings. This is known as the Lexos <strong>workspace</strong>. You may save your workspace at any time by clicking the <strong>Workspace</strong> button at the top of the banner. Uploading this file in the <strong>Upload</strong> tool will restore all your files and settings from their state when you downloaded the workspace.<br /><br />Note: If you are using the online version of Lexos at <a target="_blank" href="http://lexos.wheatoncollege.edu/">http://lexos.wheatoncollege.edu/</a>, your session folder may be stored on the server for up to a month. If you leave and return to Lexos, you may find that your last workspace pops up automatically. But we don&#39;t recommend that you rely on this.<br /><br />The <strong>Reset</strong> button will destroy your current session, start a new one, and redirect you to the <strong>Upload</strong> tool. If you ever encounter an error, you may find that the functionality of Lexos can be restored by clicking the <strong>Reset</strong> button or by replacing <code>/upload</code>, <code>/manage</code>, or whatever tool is at the end of the url in the browser with <code>/reset</code>.<br /><br />The <strong>Gear</strong> button in the top right corner of the interface opens a dialog with a message about Lexos. You can also click the <strong>Use Beta functions</strong> checkbox to enable Lexos&#39; Beta functions. These are new tools that are not yet fully tested. By default, they are hidden, but they will become visible if you select this option. Use Beta functions with caution, as they are not yet considered stable.<br /><br />Beneath the Lexos banner is the menu bar, which is organized to emphasize the <a href="http://scalar.usc.edu/works/lexos/lexos">Lexomics workflow</a>. On the right side of the banner Lexos displays a folder icon if you have active documents. Mousing over the icon will display a tooltip showing the number of active documents. Clicking on it will open the Lexos <strong>Manage</strong> tool.<h3>The <em>In the Margins</em> Panel</h3>The <em>In the Margins</em> Panel can be accessed from all tools in Lexos by clicking the the small tab on the left edge of the screen. Clicking the tab again will close the panel. The <em>In the Margins</em> Panel contains the text of the Lexos Manual page for the tool currently in use. Click on the title link to open the page in a new window. This will give you access to the entire <em>In the Margins</em> website.<h3>Feedback and Support</h3>If you have questions or suggestions, click the <strong>Feedback and Support</strong> link at the bottom of the screen. We also welcome bug reports on our <a target="_blank" href="https://github.com/WheatonCS/Lexos/issues">GitHub site</a>.<h3>Language and Terminology</h3>Lexos has been designed using the insights of many different disciplines which often use different language for the same or similar concepts. In choosing terminology to label functions in the interface, we have attempted to walk a tightrope between familiar language, jargon, and language that might be inaccurate some users. Perhaps the most noticeable example is the use of &quot;word&quot;&mdash;a very slippery concept indeed. Computational approaches to textual analysis can only work with countable units, and it is not always easy to identify what constitutes a &quot;word&quot;. In Western written languages, words are often designated by delimiters such as spaces and punctuation marks, but this does not apply to all languages. In order to be as neutral as possible, we adopt usage common in computational linguistics and machine learning. We refer to countable units as &quot;tokens&quot; and their unique forms as &quot;terms&quot;. This usage may at first feel unfamiliar to many humanities students and scholars, but we believe that it is preferable to avoid the problematic use of &quot;word&quot;. On the other hand, for some tools concepts &quot;word clouds&quot;, where &quot;word&quot; is well-established or otherwise useful, we have retained it. In this case, it should taken to be synonymous with &quot;term&quot;.<br /><br />Another usage we adopt from machine learning is the generic term &quot;document&quot; to refer to any type of text. In many disciplines, &quot;documents&quot; refers to particular types of &quot;non-literary&quot; text such as laws, treatises, invoices, and other types of records designed primarily without an aesthetic purpose in mind. Such a distinction is arguably an intellectual construct, but from a computational point of view there is no difference between a law and a lyric. Both consist of list, or vectors, of countable tokens. Furthermore, if you cut them into smaller segments, you are left--again, from a computational point of view&mdash;with smaller vectors just like the originals. Hence it is appropriate to use the same term, &quot;document&quot; for both the whole text and segments of the text. In practice, this means that we adopt a variety of terms. On your computer, your texts are stored in &quot;files&quot;. When you upload them to Lexos, they become &quot;documents&quot; in the Lexos workspace. You may use Lexos to manipulates any documents in the workspace, whether they consist of the whole text or segments derived from them. We sometimes use &quot;text&quot; when we need a term that refers to the object of study and &quot;segments&quot; when we are referring specifically to slices of larger documents.<br /><br />If you ever get stuck with the terminology employed in Lexos, <em>In the Margins</em> has a full <a href="glossary">Glossary</a>.</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-24T09:49:41+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:839906"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/interface"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/kmeans">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-15T17:42:48+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:314549"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/kmeans.4"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/kmeans.4"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/kmeans.4">
<ov:versionnumber>4</ov:versionnumber>
<dcterms:title>The K-Means Clustering Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos K-Means Clustering tool</dcterms:description>
<sioc:content><p>The Lexos <strong>K-Means Clustering</strong> tool partitions your active documents into flat clusters in a way that minimizes the variation within the clusters. It produces a scatterplot graph in which you can visualize the distance between documents or clusters. The &quot;K&quot; in &quot;K-Means&quot; refers to the number of partitions. For instance, if you wish to cluster your documents into three groups, you would set <code>K=3</code>. The default is the number of active documents, but you will probably want to set this to a smaller number. There is no obvious way to choose the number of clusters. It can be helpful to perform hierarchical clustering before performing K-Means clustering, as the resulting dendrogram may suggest a certain number of clusters that is likely to produce meaningful results. The K-means procedure is very sensitive to the position of the initial seeds, although employing the <strong>K-means++</strong> setting can help to constrain this placement.</p><p>Lexos provides two methods of visualizing K-means cluster analyses. The default <strong>Voronoi Cells</strong> identifies a centroid (central point) in each cluster and draws a trapezoidal polygon around it. This is helpful in allowing you to see which points fall into which cluster. Select <strong>PCA</strong> in the <strong>Method of Visualization</strong> dropdown to view the graph as a <em><a target="_blank" href="https://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a></em>, where dots on the plane are colored to mark their cluster membership. Both visualization approaches can you judge distances between clusters.</p><h3>Generating and Reading a K-Means Cluster Analysis</h3><p>Simply click the <strong>Get K-Means</strong> button to perform a K-means cluster analysis. If you wish, you can modify the default settings using the <strong>Advanced K-Means Options</strong> and <strong>Silhouette Score Options</strong> menus described below.</p><p>K-Means cluster analyses can contain a lot of points that are very close together, making the graph difficult to read. In order to aid the process, Lexos provides a table to the left of the graph which displays your documents and color codes them to indicate which cluster they belong to. The same colors are used in the graph. In the Voronoi cell graph, you can move your mouse cursor over the document in the table or a point on the graph to reveal a tooltip label showing the document&#39;s name.</p><h4>Advanced K-Means Options</h4><p>Since cluster membership is adjusted at each stage of the process by the re-location of the centroids, the number of iterations required and other factors can be adjusted to select a cutoff point for the algorithm or a desired threshold for convergence of different clusters. These adjustments are handled by the <strong>Advanced K-Means Options</strong>: <strong>Maximum Number of Iterations</strong>, <strong>Method of Initialization</strong>, <strong>Number of Iterations with Different Centroids</strong>, and <strong>Relative Tolerance</strong>. As with the initial choice of cluster numbers, there are no hard and fast rules for how these factors should be applied. The default settings should serve most users&#39; purposes. However, here are some brief descriptions of the purposes of each option:</p><p><u>Maximum number of iterations:</u> The K-means algorithm will continue to re-compute centroids for each cluster until all documents settle down into &quot;final&quot; clusters. It is possible that a situation occurs where a document continues to toggle back and forth between two clusters. Setting this value avoids an endless, or at least an unnecessary number of repetitions of the algorithm with little change.</p><p><u>Method of Initialization:</u> Your results of using K-means on a collection of documents can vary significantly depending on the initial choice of centroids. In Lexos the user is offered two choices: <strong>K-Means++</strong> and <strong>Random</strong>. When using the default <strong>K-Means++</strong> setting, Lexos chooses the first of the K clusters at random (typically by picking any one of the documents in the starting set as representative of a center of a future cluster). The remaining (K-1) cluster centers are then chosen from the remaining documents by computing a probability proportional to the distances of the centers already chosen. Once all centroids are chosen, normal K-Means clustering takes place. The <strong>Random</strong> setting employs a &quot;random seed&quot; approach in which the locations of <em>all</em> centroids at the initial stage are generated randomly. It is best to experiment multiple times with different random seeds.</p><p><u>Number of Iterations with Different Centroids:</u> Documentation of this feature is not yet available.</p><p><u>Relative Tolerance:</u> Documentation of this feature is not yet available.</p><h4>Silhouette Score Options</h4><p>Silhouette scores give a general indication of how well individual objects lie within their cluster and are thus one method of <a href="establishing-robust-clusters">measuring cluster robustness</a>. A score of 1 indicates tight, distinct clusters. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar</p><p>To generate a silhouette score for your dendrogram, click on the <strong>Silhouette Score Options</strong> menu. The only option is to change the <strong>Distance Metric</strong> used for measuring the distance between points. For further information, see <a href="choosing-a-distance-metric">Choosing a Distance Metric</a>. Once you have selected a distance metric, click the <strong>Get K-Means</strong> button and the silhouette score will appear below the button when the process is complete.</p><h3>Downloading K-Means Graphs</h3><p>There is currently no method for downloading Voronoi graphs, and we recommend taking screen shots. For PCA graphs, you can right-click and use your browser&#39;s <strong>Save image as...</strong> function. We recommend clicking the <strong>Enlarge Graph</strong> button to open the image in a new window.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-22T16:38:02+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:839366"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/kmeans"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/lemmas">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-02T08:30:19+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:158933"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/lemmas.2"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/lemmas.2"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/lemmas.2">
<ov:versionnumber>2</ov:versionnumber>
<dcterms:title>Lemmas</dcterms:title>
<sioc:content>The Lemmas option allows you to replace different words throughout the selection with a single new word. This is most often used to disambiguate varied spellings of a given word, such as in the case of kyng, cyng, and king. Using the Lemmas option, you could simply input a list in the form 'kyng, cyng: king' to replace every 'kyng' and 'cyng'&nbsp; in the text with 'king'. Hopefully a bloodless coup.</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-02T09:06:32+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:394032"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/lemmas"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/lexomics">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-16T00:11:22+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:173671"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/lexomics.2"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/lexomics.2"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/lexomics.2">
<ov:versionnumber>2</ov:versionnumber>
<dcterms:title>Lexomics</dcterms:title>
<dcterms:description>The starting point for the Lexomics path</dcterms:description>
<sioc:content><span>The term &ldquo;lexomics&rdquo; was originally used to describe the computer-assisted detection of &ldquo;words&rdquo; (short sequences of bases) in genomes,<sup><a href="http://www.jstor.org/stable/10.1086/668252#fn15">*</a></sup> but we have extended it to apply to literature, where lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. Using statistical methods and computer-based tools to analyze data retrieved from electronic corpora, lexomic analysis allows us to identify patterns of vocabulary use that are too subtle or diffuse to be perceived easily. We then use the results derived from statistical and computer-based analysis to augment traditional literary approaches including close reading, philological analysis, and source study. Lexomics thus combines information processing and analysis with methods developed by medievalists over the past two centuries. We can use traditional methods to identify problems that can be addressed in new ways by lexomics, and we also use the results of lexomic analysis to help us zero in on textual relationships or portions of texts that might not previously have received much attention.<br /><br />More information can be found on the <a target="_blank" href="http://lexomics.wheatoncollege.edu/">Lexomics</a> website.</span></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<scalar:continue_to_content_id>314553</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-16T20:22:36+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:834847"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/lexomics"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/lexos">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2015-08-16T00:13:08+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:173674"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/lexos.6"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/lexos.6"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/lexos.6">
<ov:versionnumber>6</ov:versionnumber>
<dcterms:title>The Lexos workflow</dcterms:title>
<dcterms:description>The main starting page for the Lexos software path</dcterms:description>
<sioc:content>So, you&#39;ve got a group of texts and you want to explore them in new (computational) ways. But, where to start? What to do first? There are many decisions to make as you apply computational methods to your digital files.<br /><br /><strong>Upload --&gt; Scrub --&gt; Segment --&gt; Count --&gt; Cull --&gt; Analyze --&gt; Visualize </strong><br /><em>(follow a path or jump around, repeat as needed)</em><br /><br />The <em>Lexos</em> workflow provides a user experience that calls attention to the series of decisions you must make when working with digital texts. Together, a series of decisions in a workflow represents your experiment&#39;s methodology, essentially the Methods section in a publication. In addition to providing entry points for discussions of the workflow (e.g., <span style="line-height: 13.8666658401489px;">sharing effective practices when making choices),</span> it has not escaped our notice that by explicitly addressing the many steps in the process strengthens the dissemination of results and&nbsp;lends to&nbsp;the repeatability of experiments.</sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3689"/>
<dcterms:created>2016-06-02T14:21:22+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:777328"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/lexos"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:777328:839675:1">
<scalar:urn rdf:resource="urn:scalar:path:777328:839675:1"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/lexos.6"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/the-lexomics-workflow.30#index=1"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/the-lexomics-workflow">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-02T07:17:11+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:158915"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/the-lexomics-workflow.30"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/the-lexomics-workflow.30"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/the-lexomics-workflow.30">
<ov:versionnumber>30</ov:versionnumber>
<dcterms:title>The Lexomics Workflow</dcterms:title>
<sioc:content><p>We call Lexos &quot;An Integrated Lexomics Workflow&quot; because it brings together many of the processing steps we in the Lexomics project regularly perform in our research. Some history of the Lexomics project may give some useful perspective on what we mean by a workflow. When the Lexomics project began, it consisted of three simple PERL scripts: one to clean-up texts, one to cut them, and one to perform cluster analysis on them. Each script had to be run in sequence. So, after a while, it made sense to create a single tool that would guide the user from one to the next. It then became clear the tool&#39;s interface could allow the user to go back to earlier steps, tweak the settings, and then repeat their experiments. There were in fact many ways in which a user could design experiments using a single tool, and the tool could help the user manage their activities and, perhaps more importantly, to think critically about their process. Thus was Lexos born.<br /><br />While the strictly linear steps of its origins are no longer the only possible approaches you can adopt when using Lexos, they provided an important insight about how computational text analysis workflows are constructed. They essentially have three basic steps: <strong>pre-processing (scrubbing)</strong>, <strong>analysis</strong>, and <strong>visualization</strong>. It is not always possible to clearly separate these activities. Even in our earliest scripts, the first two were pre-processing steps and the last, which plotted a tree diagram of the cluster analysis, combined analysis and visualization. But, as Lexos has developed, we have tried to make this its organizing principle, encouraging the user to proceed from text preparation to simple visualization of their data to more complex analysis. This is particularly useful for entry-level users and those whose training has not explored the issues raised by computational methods. (<em>In the Margins</em> is our attempt to position the process of computational text analysis side by side with its product.) Lexos is thus designed to enable newcomers to the field to adopt the Lexomics workflow, empowering them to do sophisticated work in relatively little time.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<scalar:continue_to_content_id>158915</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-23T15:20:33+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:839675"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/the-lexomics-workflow"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/manage">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-12T12:56:17+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:314084"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/manage.2"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/manage.2"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/manage.2">
<ov:versionnumber>2</ov:versionnumber>
<dcterms:title>The Manage Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos Manage tool</dcterms:description>
<sioc:content><p><b>Manage</b> is the tool you use to perform various types of &quot;housekeeping&quot; on documents in your Lexos workspace. In addition to documents derived from files you have uploaded, <b>Manage</b> will also list documents created by other tools such as segments produced by the Cutter tool.</p><p>Use the <b>Manage</b> tool for the following purposes</p><ul><li>To activate and de-activate documents in your workspace. By default, most Lexos tools will only operate on your active documents.</li><li>To delete unwanted documents from your workspace.</li><li>To re-name or classify documents in your workspace.</li></ul><h4>The Manage Interface</h4><p>Documents in your workspace are listed in the form of a table. When the uploaded file from which each document is derived will be listed by filename in the <b>Original Source</b> column. The <b>Document Name</b> column lists the filename without the extension. If you use Lexos tools to create new documents based on your uploaded files, the original filename will be displayed in the <b>Original Source</b> column, and a new name will be generated for the <b>Document Name</b> column. Document names can be changed as described in <b>Using the Context Menu</b> below.</p><p>Be default, documents created by file upload or a Lexos tool do not have an associated class, so the <b>Class Label</b> column is empty. The <b>Excerpt</b> column shows the beginning and end of each document separated by an ellipsis (...). Columns can be sorted alphabetically clicking on the column header. The table highlights columns in blue to show which column is being used to sort the listed documents. If you have a large table, you can filter it down to a few rows containing keywords entered in the <b>Search</b> field. The text of the entire table is searched, so matches may be found in any column. You may use the <b>Display</b> dropdown menu to increase the number of rows displayed, or you can use the pagination links at the bottom right of the table to paginate through smaller sets of rows.</p><h4>Activating, De-Activating, and Deleting Documents</h4><p>By default, all documents are activated when they are uploaded. Rows containing active documents are highlighted in green. The following methods can be used to manage the active state of documents:</p><ul><li><b>Single Click</b>: This will de-activate all documents and toggle the state of the row clicked. If it is active, it will be de-activated. If it is not active, it will be activated.</li><li><b>Control or Command Click</b>: This will toggle the state of the row clicked without affecting the state of any other rows.</li><li><b>Shift Click</b>: This activate ranges of rows. Shift-clicking on a row will activate documents in all rows between the row clicked and the first active row above or below the row clicked.</li><li><b>Drag Click</b>: Clicking on a row with the mouse button held down will activate or de-activate all rows between the row clicked and the row the mouse cursor is over when the mouse button is released.</li><li><b>Right Click</b>: This will open the context menu. See <b>Using the Context Menu</b> below.</li><li><b>The Select All and Deselect All Buttons</b>: These are useful because they activate and de-activate all the documents in your workspace, not just those displayed on the page.</li></ul><p>Documents may also be activated and de-activated using the <b>Context Menu</b> as described below.</p><p>Certain tools such as <b>Word Cloud</b> allow you to select and de-select sub-sets of your active documents. These selections apply only within the given tool and do not affect whether the documents are active or not throughout the Lexos suite. If you need to change the state of a document so that it is or is not accessible to all tools, you should do this using <b>Manage</b>.</p><h4>Deleting Documents</h4><p>Deleting individual documents from the workspace is probably achieved most easily achieved using the <b>Context Menu</b> as described below. However, you can deselect all documents, activate only the document you wish to delete, and then click the <b>Delete Selected</b> button. This button is probably more useful when you have multiple active documents, as it will delete them all at once. Make sure that you have de-activated any documents you do not wish to delete.</p><h4>Using the Context Menu</h4><p>Right-clicking on a table cell or row will open the context menu. It has the following options:</p><ul><li><b>Preview Document</b>: This will open a dialog containing the entire text of your document (without formatting or white spaces). Note that longer documents can take a while to load, so please be patient.</li><li><b>Edit Document Name</b>: This function allows you to create a new name for the document in the row you have clicked. To change the name, enter your new name in the dialog form field and click <b>Save</b>.</li><li><b>Edit Document Class</b>: This function allows you to create a class label for the document in the row you have clicked. Enter the label you wish to identify with the class in the dialog form field and click <b>Save</b>. See further the section on document classes below.</li><li><b>Delete Document</b>: This function will delete the individual document in the row you have clicked.</li><li><b>Select All Documents and Deselect All Documents</b>: These options have the same function as the <b>Select All</b> and <b>Deselect All</b> buttons.</li><li><b>Apply Class to Selected Documents</b>: If you have multiple active documents, this option will allow you to apply a class label to all of them at once. Enter the label you wish to identify with the class in the dialog form field and click <b>Save</b>. See further the section on document classes below.</li><li><b>Delete Selected Documents</b>: If you have multiple active documents, this option will allow you to delete them all at once. It has the same function as the <b>Delete Selected</b> button.</li></ul><h4>Classifying Documents</h4><p>Document classes are groups of documents identified as belonging to the same category defined by some human-assigned criterion. For instance, a collection of novels might be separated into two classes based on whether they were published in Britain or the United States. Gender, genre, and date of authorship might also be used to classify documents. Lexos&#39; class labels allow you to assign classes to documents and sort by class in the <b>Manage</b> tool. At present, document classes are under-utilized elsewhere in the Lexos suite, but they are an important part of the <b>Topwords</b> tool. In general, you should assign class labels in <b>Manage</b> before going to <b>Topwords</b>.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-12T13:50:28+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:831436"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/manage"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/manual">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-15T18:02:04+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:314553"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/manual.10">
<ov:versionnumber>10</ov:versionnumber>
<dcterms:title>Manual</dcterms:title>
<dcterms:description>Start page for the Lexos Manual</dcterms:description>
<sioc:content><h3>Introduction</h3><p>The Lexos Manual is the &quot;how to&quot; guide for the Lexos suite. Each tool is documented with instructions for how to use the various configurations in the interface. The manual attempts to present a straightforward account of how to use Lexos, but it also hints at the wider intellectual issues the user is presented with in using a tool like Lexos. In such cases, the Manual often links to more in-depth discussions in the <strong>Topics</strong> section of <em>In the Margins</em>.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<scalar:continue_to_content_id>173669</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3689"/>
<dcterms:created>2018-05-23T20:50:16+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1817392"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/manual"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:839689:1">
<scalar:urn rdf:resource="urn:scalar:path:1817392:839689:1"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/how-to-run-lexos.2#index=1"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:839675:2">
<scalar:urn rdf:resource="urn:scalar:path:1817392:839675:2"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/the-lexomics-workflow.30#index=2"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:839906:3">
<scalar:urn rdf:resource="urn:scalar:path:1817392:839906:3"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/interface.2#index=3"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:834849:4">
<scalar:urn rdf:resource="urn:scalar:path:1817392:834849:4"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/upload-tool.3#index=4"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/upload-tool">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-12T12:31:31+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:314077"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/upload-tool.3"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/upload-tool.3"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/upload-tool.3">
<ov:versionnumber>3</ov:versionnumber>
<dcterms:title>The Upload Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos Upload tool</dcterms:description>
<sioc:content><p><b>Upload</b> is the standard starting point for the Lexos workflow. When you begin a new session or reset your workspace, you will be automatically re-directed to <b>Upload</b>.</p><p>Use of the tool is fairly straightforward. Drag your document files into the box labeled <b>drop files here</b>, or click the <b>Browse</b> button to use your web browser&#39;s file browser to locate your files. Most browsers will allow you to shift- or control-click to select multiple files.</p><p>There are some restrictions on file upload size in order to prevent the browser from hanging. Nevertheless, upload times may be slow for large files, particularly if you are working over the internet. The maximum file size of 250MB is approximately the size of of nine Webster&#39;s Unabridged Dictionaries. If you experience a problem, try uploading smaller files, or, if you are uploading many files, try uploading them in smaller batches.</p><p>Lexos accepts files in <code>.txt</code>, <code>.html</code>, <code>.xml</code>, and <code>.sgml</code>. Make sure that your filenames contain these extensions.</p><p>Once you have selected your files, they will begin to upload, one at a time. As each upload is complete, you will see a notification at the bottom of the screen shortly after the <b>Ready For Files To Upload</b> progress bar has said &quot;Complete!&quot; The bigger the file the longer it will take to upload and show up on the page. After uploading is complete, each file is considered a document by Lexos. You can activate, de-activate, and re-label, and classify your documents using the Manage tool.</p><p><b>Note on character encoding</b>: Lexos will automatically convert all files to <a target="_blank" href="https://en.wikipedia.org/wiki/UTF-8">UTF-8 character encoding</a>. If you are uploading HTML, XML, or SGML files that contain special characters, the Scrubber tool will help you to convert them to UTF-8 characters.</p><h4>The Lexos Beta Web Scraper</h4><p>At present, your documents must be available as files on your computer. However, Lexos has a Beta web scraper tool, which will allow you to download files off the internet. This is especially useful when you are using files from sources such as <a href="https://www.gutenberg.org/">Project Gutenberg</a>. To enable the web scraper, click the &quot;Gear&quot; icon in the top right corner of the screen and select the <b>Use Beta functions</b> checkbox. A link to the web scraper tool will appear above the <b>Browse</b> button. Wherever possible, use it to download plain text files since, otherwise, you will download all the HTML markup in a web page (this can be removed using the Scrubber tool). Upload times may vary, depending on internet speeds. If the process seems to hang, try uploading fewer urls. Large-scale web scraping should not be done in Lexos.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-16T20:30:06+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:834849"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/upload-tool"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:831436:5">
<scalar:urn rdf:resource="urn:scalar:path:1817392:831436:5"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/manage.2#index=5"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:1052905:6">
<scalar:urn rdf:resource="urn:scalar:path:1817392:1052905:6"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/scrubber.46#index=6"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/scrubber">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-02T07:37:46+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:158923"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/scrubber.46"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/scrubber.46"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/scrubber.46">
<ov:versionnumber>46</ov:versionnumber>
<dcterms:title>The Scrubber Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos Scrubber tool</dcterms:description>
<sioc:content><p>Preprocessing your texts, what we refer to as &quot;scrubbing&quot;, is a critical step in the Lexos workflow. In order to facilitate a conscious consideration of the many small decisions required, scrubbing options are isolated into individual choices. If for no other reason, your careful deliberation and choice of the many options facilitates a replication of your analyses in the future, both by you and others who wish to verify your experiment.</p><p>The Scrubber tool interface allows you to select and combine options on the left side of the screen. Click the <strong>Preview Scrubbing</strong> button to see the results in the preview windows below. At this point, only the beginning and ending of each document is displayed, separated by an ellipsis (&hellip;). When you are satisfied that you have achieved the desired effect, click the <strong>Apply Scrubbing</strong> button. Your documents will be scrubbed, and the scrubbed versions will be used by all the other Lexos tools.</p><p>Scrubbing affects all active documents and cannot be undone. So make sure to de-activate any documents you do not wish to scrub using the <strong>Manage</strong> tool. If you apply scrubbing and later wish to revert to the unscrubbed version, you will have to upload another copy to Lexos.</p><p>Scrubbing is an algorithm: a series of steps applied in a specific order. If you wish to change that order, you will need to de-select some options, scrub, re-select them, and then scrub again. The order of operations is provided in <strong>The Lexos Scrubber Algorithm</strong> section below.</p><h3>Scrubbing Options</h3><ol><li><strong>Remove <a href="https://www.gutenberg.org">Project Gutenberg</a> boilerplate material</strong>: Upon entering the Scrubber page, if you have uploaded a file from the Project Gutenberg website without removing the boilerplate material (i.e., text added by the Project Gutenberg site at the top and license material at the end of the text), you will receive the following warning:<blockquote><p>One or more files you uploaded contain Project Gutenberg licensure material. You should remove the beginning and ending material, save, and re-upload the edited version. If you Apply Scrubbing with a text with Gutenberg boilerplate, Lexos will attempt to remove the majority of the Project Gutenberg Licensure, however there may still be some unwanted material left over.</p></blockquote><p>Note that if you select the &lsquo;Apply Scrubbing&rsquo; button without removing this extra text, Lexos will attempt to remove the Project Gutenberg boilerplate material at the top and end of the file. However, since Project Gutenberg texts do not have a consistent boilerplate format, we suggest you remove the boilerplate material using a text editor before uploading it to Lexos in order to prevent unwanted text from being included in subsequent analyses, e.g., including Project Gutenberg licensure material in your word counts. If you choose to let Lexos do the work for you, we recommend that you preview the beginning and ending of the document after you have scrubbed in the <a href="manage">Manage</a> tool in order to ensure that Lexos has not left any boilerplate or deleted any of your text. Lexos&rsquo; attempt to remove start and ending boilerplate material only applies to files from the Project Gutenberg website. When choosing a file from this website, we recommend the &ldquo;Plain Text UTF-8&rdquo; version. It is smaller, so it will upload faster, and you will not have to remove any HTML markup.</p></li><li><p><strong>Remove All Punctuation</strong>: Lexos assumes that uploaded files may be in any language and automatically converts them to <a target="_blank" href="https://en.wikipedia.org/wiki/Unicode">Unicode</a> using <a target="_blank" href="https://en.wikipedia.org/wiki/UTF-8">UTF-8 character encoding</a>. This requires that Lexos recognize punctuation marks from a wide variety of languages. All Unicode characters have an associated set of metadata for classifying its &ldquo;type&rdquo;, e.g. as a letter, punctuation, or symbol. If the <strong>Remove All Punctuation</strong> option is selected, any Unicode character in each of the active texts with a &ldquo;Punctuation Character Property&rdquo; (that character&rsquo;s property begins with a &lsquo;P&rsquo;) or a Symbol Character Property (begins with &lsquo;S&rsquo;) is removed. A guide to Unicode Character Categories can be found on <a target="_blank" href="http://www.fileformat.info/info/unicode/category/index.htm">fileformat.info</a>.</p><p>If <strong>Remove All Punctuation</strong> is selected, three additional sub-options are available:</p><ul><li><strong>Keep Hyphens</strong>: Selecting this option will change all variations of Unicode hyphens to a single type of hyphen (&quot;-&quot;) and this will be left in the text. Hyphenated words (e.g., &ldquo;computer-aided&rdquo;) will be subsequently treated as a single token. Further discussion of the limitations can be found [here](link to scrubbing-topic/keep-hyphen).</li><li><strong>Keep Word-Internal Apostrophes</strong>: If this option is selected, apostrophes will be retained in contractions (e.g., <em>can&rsquo;t</em>) and possessives (<em>Scott&rsquo;s</em>), but not those in plural possessives (<em>students&rsquo;</em> becomes the term&nbsp;<em>students</em>) nor those that appear at the start of a token (<em>&#39;bout</em> becomes the term&nbsp;<em>bout</em>). Further discussion of the limitations can be found [here](link to scrubbing-topic/keep-word-internal-apostrophes).</li><li><strong>Keep Ampersands</strong>: This option will not treat ampersands as punctuation marks and will retain them in the text. Note that HTML, XML, and SGML entities such as <code>&amp;aelig; </code> (<em>&aelig;</em>) are handled separately and prior to the <strong>Keep Ampersands</strong> option. You can choose how to convert these entities to standard Unicode characters using the <strong>Special Characters</strong> option.</li></ul></li><li><strong>Make Lowercase</strong>: Converts all uppercase characters to lowercase characters so that the tokens <em>The</em> and <em>the</em> will be considered as the same term. In addition, all contents (whether in uploaded files or entered manually) for the <strong>Stop Words/Keep Words</strong>, <strong>Lemmas</strong>, <strong>Consolidations</strong>, or <strong>Special Characters</strong> options will also have all uppercase characters changed to lowercase. Lowercase is not applied inside any HTML, XML, or SGML markup tags remaining in the text.</li><li><strong>Remove Digits</strong>: Removes all number characters from the text. Similar to the handling of punctuation marks, any Unicode character in each of the active texts with a &ldquo;Number Character Property&rdquo; is removed. For example, this option will remove a Chinese three (㈢) and Eastern Arabic six (۶) from the text. Note: at present, Lexos does not match Real numbers as a unit. For example, for <em>3.14</em>, Lexos will remove (only) the 3, 1, and 4 and the decimal point will be removed only if the <strong>Remove All Punctuation</strong> option is selected. <strong>Remove Digits</strong> is not applied inside any HTML, XML, or SGML markup tags remaining in the text.</li><li><strong>Remove Whitespace</strong>: Removes all whitespace characters (blank spaces, tabs, and line breaks), except in HTML, XML, and SGML markup tags. Removing whitespace characters may be useful when you are working with non-Western languages such as Chinese that do not use whitespace for word boundaries. In addition, this option may be desired when tokenizing by character n-grams if you do not want spaces to be part of your n-grams. See the section on <a href="link%20to%20tokenize%20page">Tokenization</a> for further discussion on tokenizing by character n-grams. If <strong>Remove Whitespace</strong> is selected the following sub-options are available to allow you to fine-tune the handling of whitespace:<ul><li><strong>Remove Spaces</strong>: each <em>blank-space</em> will be removed.</li><li><strong>Remove Tabs</strong>: each tab character ( <code>\t </code>) will be removed.</li><li><strong>Remove Line Break</strong>: each newline character ( <code>\n </code>) and carriage return character ( <code>\r </code>) will be removed.</li></ul></li><li><strong>Scrub Tags</strong>: Handles markup tags in angular brackets, such as those used in XML, HTML, and SGML. In markup languages like these, start and end tags like <code>&lt;p&gt;...&lt;/p&gt; </code> are used to designate an &ldquo;element&rdquo;. Elements may be modified by &ldquo;attributes&rdquo; specified inside the start tag. For instance, a text using the the <a target="_blank" href="http://www.tei-c.org/index.xml">Text Encoding Initiative (TEI)</a> specification for XML might contain the markup <code>&lt;p rend=&quot;italic&quot;&gt;...&lt;/p&gt; </code> for a paragraph in italics. When this option is selected, a gear icon will appear. Click the icon to open the tag scrubbing dialog. This will allow you to choose one of four options to handle each type of tag or to handle all the tags at once:<ul><li><strong>Remove Tag Only (default)</strong>: Removes the start and end tags but keeps the content in between. For instance, <code>&lt;p&gt;Some text&lt;/p&gt; </code> will be replaced by <code>Some text </code>.</li><li><strong>Remove Element and All Its Contents</strong>: Removes the start and end tags and all the content in between. For instance, <code>&lt;p&gt;Some text&lt;/p&gt; </code> will be removed entirely.</li><li><strong>Replace Element&rsquo;s Contents with Attribute Value</strong>: Replaces the element with the value of one of its attributes. Since elements may have multiple attributes, Lexos allows you to enter the name of the attribute you wish to use. For instance, if you have some markup like <code>&lt;stage type=&quot;setting&quot;&gt; Scene &lt;view&gt;Morning-room in Algernon&#39;s flat in Half-Moon Street.&lt;/view&gt;&lt;/stage&gt; </code>, you could use this option to replace the entire scene description with <code>setting </code> if you entered <code>type </code> as the attribute name.</li><li><strong>Leave Tag Alone</strong>: This option will leave the specified element untouched in the text. This is especially useful if you want to scrub only certain markup tags.</li></ul><p><strong>Troubleshooting</strong>: Lexos compiles a list of the tags in your documents by first attempting to parse the documents as XML. If the markup is not well-formed XML, it next tries to parse the documents as HTML using Python&rsquo;s <a target="_blank" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">BeautifulSoup</a> library. This will generally work with the proviso that BeautifulSoup automatically converts all tags to lowercase. As a result, the Lexos scrubbing function will miss HTML (and SGML) tags that contain uppercase letters. In this case, you may have to check each of the tags Lexos finds to make sure it does not have uppercase letters in your original document. If you find that Lexos is not scrubbing tags containing capital letters, you will have to change these in an editor before uploading the files. This issue does not affect valid XML files since XML parsers are case sensitive. If Lexos is unable to compile an accurate list of the tags in your XML file, we recommend testing the file with an <a target="_blank" href="http://www.w3schools.com/xml/xml_validator.asp">XML Validator</a>.</p></li></ol><h4>Additional Options</h4><ol><li><strong>Stop Words/Keep Words</strong>: &ldquo;Stop Words&rdquo; represents a list of words or terms to <em>remove</em> from your documents, and &ldquo;Keep Words&rdquo; represents a list of words or terms that should remain in your documents with all other words removed. In both cases, words must be entered as comma-separated or line-separated lists like the following:<pre><code><code>a, some, that, the, which
a
some
that
the
which
</code>
</code>
</pre> You may enter these lists manually in the provided form area or upload a file (e.g. <code>stopWords.txt </code>). Note that the &lt;b&gt;Make Lowercase&lt;/b&gt; option will be applied to your list of stop/keep words if that option is also selected.</li><li><strong>Lemmas</strong>: Replaces all instances of terms in a list with a common replacement term called a &ldquo;lemma&rdquo;. Lemmas might be conceived of as dictionary headwords. Using the lemmas option will allow you to count a lemma and all of its variants (such grammatically inflected forms) as a single term. For instance, in Old English, the word for &ldquo;king&rdquo;, <em>cyning</em> may occur as <em>cyninges</em> (possessive) or <em>cyningas</em> (plural), amongst other variants. If each of these forms occurs one time in a text, the <strong>Lemmas</strong> function will instruct Lexos to treat this as three occurrences of the type <em>cyning</em>. Lemmas are specified by providing a comma-separated list of variants followed by a colon and then the lemma. Multiple lemmas can be specified in separate lines as shown below:<pre><code>cyninges, cyningas: cyning
Beowulfes, Beowulfe: Beowulf
</code>
</pre> The list may be entered manually in the form provided or uploaded from a file. Note that the <strong>Make Lowercase</strong> option will be applied to your list of tokens and lemmas if that option is also selected. To replace individual characters with other characters, you should use the <strong>Consolidation</strong> option.</li><li><strong>Consolidations</strong>: Replaces a list of characters with a different character. This is typically to consolidate symbols considered equivalent. For instance, in Old English the character common character &ldquo;eth&rdquo; <em>&eth;</em> is interchangeable with the character &ldquo;thorn&rdquo; <em>&thorn;</em>. The <strong>Consolidations</strong> option allows you to choose to merge the two using a single character. Consolidations should be entered in the format <code>&eth;: &thorn; </code>, where you wish to change all occurrences of <code>&eth; </code> to <code>&thorn; </code>. Multiple consolidations can be separated by commas or line breaks. Consolidations can be entered manually in&nbsp;the provided form field or uploaded from a file. Note that the <strong>Make Lowercase</strong> option will be applied to your list of characters if that option is also selected. To replace entire words (terms) with other words, you should use the <strong>Lemma</strong> option.</li><li><strong>Special Characters</strong>: Replaces character entities with their glyph equivalents. A <a target="_blank" href="https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references">character entity</a> is a symbolic representation for an actual character symbol (glyph). Entities are used by markup languages like HTML, XML, and SGML when the symbol itself cannot be entered in the editor used to produce the text or when the method of rendering the character is left to independent software like a web browser. For instance, in HTML, the Old English character &ldquo;aesc&rdquo; (<em>&aelig;</em>) is represented with the entity <code>&amp;aelig; </code>. Since Lexos works entirely with Unicode characters you will most likely want to replace character entities with their Unicode equivalents prior to further analysis. The <strong>Special Characters</strong> option can be used to replace entities like <code>&amp;aelig; </code> with its corresponding Unicode glyph <em>&aelig;</em>. Lexos provides four rule sets of pre-defined entities and their corresponding glyphs:<ul><li><strong>Early English HTML</strong>: Transforms a variety of HTML entities used to encode Old English, Middle English, and Early Modern English into their corresponding glyphs.</li><li><strong>Dictionary of Old English SGML</strong>: Transforms SGML entities used by the <em>Dictionary of Old English</em> into their corresponding glyphs.</li><li><strong>MUFI 3</strong>: Transforms entities specified in version 3.0 of the Medieval Unicode Font Initiative (MUFI 3) to their corresponding glyphs.</li><li><strong>MUFI 4</strong>: Transforms entities specified in version 4.0 of the Medieval Unicode Font Initiative (MUFI 4) to their corresponding glyphs.</li></ul><p>Note: Selecting MUFI 3 or MUFI 4 will convert entities specified by the Medieval Unicode Font Initiative (MUFI) to their Unicode equivalents. In this case, the Preview window will be changed to use the <a target="_blank" href="http://junicode.sourceforge.net/">Junicode</a> font, which correctly displays most MUFI characters. However, if you downloaded your files after scrubbing, these characters may not display correctly on your computer if you do not have a MUFI-compatible font installed. Information about MUFI and other MUFI-compatible fonts can be found on the <a href="http://folk.uib.no/hnooh/mufi/">MUFI website</a>.</p><p>Note: Any special characters that appear inside tags <em>will</em> be modified.</p><p>You may also design your own rule set if you are not using a language covered by one of the pre-defined rule sets. To do this, enter your transformation rules in the provided form field. The entity should be separated from its replacement glyph by a comma (e.g. <code>&amp;aelig;, &aelig; </code>). Multiple transformation rules should be listed on separate lines. The Lexomics Project welcomes submission of the new rule sets. Please use the <strong>Feedback and Support</strong> button in Lexos or <a target="_blank" href="https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_a_wheatoncollege.edu_forms_d_e_1FAIpQLSddEsRE2PcserYwcjtNpBAMF-2DYRKVrL4H4LtWDxHeNKoVVxcA_viewform&amp;d=CwMCaQ&amp;c=Oo8bPJf7k7r_cPTz1JF7vEiFxvFRfQtp-j14fFwh71U&amp;r=fkkkcAta9tNbJT0GbA-b8fBT5Vx0day25Z1KcBOKxKQ&amp;m=pvw58nUgCb4t3z5cj9Zj2XFIXgBppHEM8aQoOb5vqpA&amp;s=ZxoSL8vQIBP526hKabavc3SaECtb_M8nMnjGHo6MiSk&amp;e=">click here</a><a href="http://junicode.sourceforge.net/"></a> to contact us about a adding pre-defined rule set to Lexos.</p></li></ol><h3>Replacing Patterns</h3><p>Sometimes it is necessary to replace a pattern rather than a precise string. For instance, if a document contains multiple URLs like <code>http://lexos.wheatoncollege.edu</code> and <code>http://scalar.usc.edu/works/lexos/</code>, and you need to strip these URLs, method is required for matching all URLs without knowing what they are in advance. This is known as regular expression (regex) pattern matching. Lexos uses regular expressions internally to perform its scrubbing options, but, as of version 3.0, it does not provide a way for users to supply their own regular expression patterns. If users need to strip or replace patterns by regular expression, it will be necessary to perform that action using a separate script or tool. A useful regular expressions tutorial can be found at <a href="https://regexone.com/" target="_blank">RegexOne</a>. Most modern text editors like <a href="https://www.sublimetext.com/" target="_blank">Sublime Text</a> and <a href="http://www.barebones.com/products/TextWrangler/" target="_blank">TextWrangler</a> accept regular expressions in their search and replace functions, and users may find them to be a convenient means of performing actions with regular expressions. We hope to add a regular expression pattern matching to Lexos in the near future. &nbsp;</p><h3>The Lexos Scrubber Algorithm</h3><p>Lexos scrubs documents by applying rules in the following order:</p><h4><u>When the <strong>Preview Scrubbing</strong> button is clicked</u></h4><p class="p1"><span class="s1">Markup tags in angular brackets are not affected by the rules below except rule 4. The actual text is not permanently modified at this point, but of course the Preview window shows a sample of what will be changed if you select **Apply Scrubbing**.</span></p><p class="p1"><span class="s1">1.&nbsp; Remove Project Gutenberg boilerplate, if present<br />2.&nbsp; Convert stopwords, keepwords, lemmas, consolidations, and special characters to lowercase (the actual text is converted to lowercase later, see step #5 below).<br />3.&nbsp; Apply special character transformations.<br />4.&nbsp; Apply markup tag scrubbing rules.<br />5.&nbsp; Convert text to lowercase.<br />6.&nbsp; Apply consolidation rules.<br />7.&nbsp; Apply lemmatization rules.<br />8.&nbsp; Apply stopword/keepword lists.<br />9.&nbsp; Remove punctuation (hyphens, apostrophes, ampersands).<br />10.&nbsp; Remove digits.<br />11.&nbsp; Remove whitespace.</span><br />&nbsp;</p><h4><u>When the <strong>Apply Scrubbing</strong> button is clicked</u></h4><p>Markup tags in angular brackets are not affected by the rules below except rule 4.</p><span style="line-height: 20.8px;">1.&nbsp; Remove Project Gutenberg boilerplate, if present</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">2.&nbsp; Convert stopwords, keepwords, lemmas, consolidations, and special characters to lowercase (the actual text is converted to lowercase later, see step #5 below).</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">3.&nbsp; Apply special character transformations.</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">4.&nbsp; Apply markup tag scrubbing rules.</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">5.&nbsp; Convert text to lowercase.</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">6.&nbsp; Apply consolidation rules.</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">7.&nbsp; Apply lemmatization rules.</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">8.&nbsp; Apply stopword/keepword lists.</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">9.&nbsp; Remove punctuation (hyphens, apostrophes, ampersands).</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;">10.&nbsp; Remove digits.</span><br style="line-height: 20.8px;" /><span style="line-height: 20.8px;"></span><span style="line-height: 20.8px;">11.&nbsp; Remove whitespace.</span><br style="line-height: 20.8px;" /></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<scalar:continue_to_content_id>158923</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2017-03-05T09:29:49+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:1052905"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/scrubber"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:831957:7">
<scalar:urn rdf:resource="urn:scalar:path:1817392:831957:7"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/cut.7#index=7"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:834853:8">
<scalar:urn rdf:resource="urn:scalar:path:1817392:834853:8"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/tokenize.13#index=8"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/tokenize">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-03T11:17:31+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:159092"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/tokenize.13"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/tokenize.13"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/tokenize.13">
<ov:versionnumber>13</ov:versionnumber>
<dcterms:title>The Tokenize/Count Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos Tokenizer tool</dcterms:description>
<sioc:content><p>The <strong>Tokenizer/Count</strong> tool, also known as <strong>Tokenizer</strong>, is the backbone for many functions in Lexos. Tokenization is the process of dividing a string of text into countable units called &ldquo;tokens&rdquo;. Tokens are typically individual characters or words, but they can also be &ldquo;n-grams&rdquo;, units composed of one or more sequences of characters or words. By default, Lexos divides text into tokens using spaces as token delimiters. However, it can be set to treat every character as a token or to treat n-gram sequences as tokens.</p><p>Once the text is divided into tokens, Lexos assembles a <strong>Document-Term Matrix (DTM)</strong>. This is a table of &ldquo;terms&rdquo; (also called &ldquo;types&rdquo;)&mdash;unique token forms&mdash;that occur in the active documents. Lexos calculates the number of times each document contains each term to produce the DTM. It displays the DTM as a table where you can explore important statistical information about your texts. Note that text corpora containing a large number of documents or types can take a while to process, so please be patient. If the table is too big, it may cause your browser to hang, and you may be forced to download the DTM to a spreadsheet program and work there. Lexos attempts to warn you when it is likely that you will need to download your data. Even if Lexos is able to display your DTM quickly, you may wish to download the data for use in other</p><h3>Using the DTM Table</h3><p>By default, Lexos displays the DTM with documents listed in columns and terms listed in rows. You may choose to transpose the table by selecting the <strong>Documents as Rows, Terms as Columns</strong> option. However, it is most likely that you will have relatively few documents and a relatively large number of terms. Transposing the matrix will produce a table with potentially hundreds or thousands of columns, requiring you to scroll horizontally to view them. Lexos will warn you when this is likely and give you the option to download the transposed table to a spreadsheet program, where you may find it easier to work. You may also click the eye icon to toggle the visibility of individual columns. If you change the setting between <strong>Documents as Columns, Terms as Rows</strong> and <strong>Documents as Rows, Terms as Columns</strong>, click the <strong>Regenerate Table</strong> button to apply the change of setting.</p><p>By default Lexos displays 10 table rows per page, but you can change this using the <strong>Display</strong> dropdown menu. You can also filter the rows by entering keywords in the <strong>Search</strong> form. To sort the table, click on a column header. A small icon next to the arrow in the header label will indicate both which column is being used for sorting and whether the sort direction is ascending or descending. Lexos calculates totals and averages for both rows and columns.</p><p>To download the DTM, click the <strong>Download CSV</strong> or the <strong>Download TSV</strong> button. &ldquo;CSV&rdquo; is short for comma-separated values, whereas &ldquo;TSV&rdquo; is short for tab-separated values. In your downloaded file, a comma or a tab will serve as the column delimiter. Spreadsheet programs can usually open both formats, but you may find one or the other easier to use for your purposes.</p><h3>Using the Advanced Options</h3><p>The configuration options in the top right inset section of the <strong>Tokenizer</strong> tool allow you to change how the DTM is built. An important feature of these options is that they are saved to your session and will apply to all the other Lexos tools that make use of the DTM. For instance, if you restrict your DTM to only the 10 most frequent terms in your corpus, this slice of your DTM will also be used to generate word clouds, cluster analyses, and so on. The same configuration options occur in the other Lexos tools, so it is possible to change the settings there. In <strong>Tokenizer</strong>, you should click the <strong>Regenerate Table</strong> button each time you change the settings to re-build the DTM with the new configuration.</p><p><strong>Tokenizer</strong> provides several methods of manipulating the DTM in the panel at the top right of the screen. Instructions for using these methods can be found in <a href="advanced-options">Advanced Options</a>.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<scalar:continue_to_content_id>159092</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-16T20:40:30+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:834853"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/tokenize"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:839446:9">
<scalar:urn rdf:resource="urn:scalar:path:1817392:839446:9"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/rolling-windows.2#index=9"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/rolling-windows">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-15T17:46:50+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:314550"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/rolling-windows.2"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/rolling-windows.2"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/rolling-windows.2">
<ov:versionnumber>2</ov:versionnumber>
<dcterms:title>The Rolling Windows Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos Rolling Windows tool</dcterms:description>
<sioc:content><p><strong>Rolling window</strong> analysis is a method of tracing the frequency of terms within a designated window of tokens over the course of a document. It can be used to identify small- and large-scale patterns of usage of individual features or to compare these patterns for multiple features. Rolling window analysis tabulates term frequency as part of a continuously moving metric, rather than in discrete segments. Beginning with the selection of a window, say 100 tokens, rolling window analysis traces the frequency of a term&#39;s occurrence first within tokens 1-100, then 2 to 101, then 3, 102, and so on until the end of the document is reached. The result can be plotted as a line graph so that it is possible to observe gradual changes in a token&rsquo;s frequency as the text progresses. Plotting different tokens on the same graph allows us to compare their frequencies.</p><p>The Lexos <strong>Rolling Windows Tool</strong> performs this analysis. It has numerous options which are best understood as part of a workflow. In the Lexos interface, the steps of this workflow are numbered 1-6. Each of these options steps is discussed below.</p><ol><li><strong>Select Active Document:</strong> Lexos performs rolling windows analysis on a single active document at a time. Use the radio buttons to select which document you would like to examine.</li><li><strong>Select Calculation Type:</strong> Lexos will plot either the average term frequency in each window (<strong>Rolling Average</strong>) or the ratio of term frequencies if you are examining multiple terms (<strong>Rolling Ratio</strong>).</li><li><strong>Enter Search Terms:</strong> These are the terms you wish to plot from your document. Enter up to 6 terms, separated by commas. When Lexos searches your document for these terms, it uses the document text, rather than the Document-Term Matrix (DTM) as its starting point. This means that you can choose to search for strings of text, individual words or terms (separated by spaces), or regular expressions (regex). A basic tutorial for using regex can be found at <a target="_blank" href="https://regexone.com/">https://regexone.com/</a>.</li><li><strong>Define Window:</strong> This is where you set the size of the window you want to use. It can consist of any number of characters, tokens (separated by spaces), or lines (separated by line breaks in the text). If your document contains milestones, click the checkbox, and the location of each milestone will be indicated on the rolling window graph by a vertical line.</li><li><strong>Choose Display Options:</strong> The <strong>Hide Individual Points</strong> option (turned on by default) produces an uninterrupted line graph, which may be easier to read. Turning this option off will show the points where each term occurs in the document. Mousing over the point will display the location of the term in the token sequence (starting from 0), along with the average or ratio at that point in the window. The <strong>Black and White Only</strong> option produces a non-color version of the graph that is suitable for downloading and publishing in journals.</li><li><strong>Get Graph:</strong> Click the <strong>Get Graph</strong> button to generate the Rolling Windows graph. Once it has been generated, the screen will scroll automatically to the top of the graph. Download buttons will also appear both above and below the graph. You can download the data by clicking the <strong>CSV Matrix</strong> button. This will give you a comma-separated values (CSV) file, which you can open in a spreadsheet program. To download the image, click either of the SVG buttons as appropriate for your browser. A new tab will open, and you can save it by right-clicking and saving the page.</li></ol><h3>Additional Graph Interactivity</h3><p>In addition to mousing over points if you have turned off the <strong>Hide Individual Points</strong>, you can drag your mouse over portions of the bottom ribbon to magnify sections of the graph.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-22T20:20:57+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:version:839446"/>
<dcterms:isVersionOf rdf:resource="http://scalar.usc.edu/works/lexos/rolling-windows"/>
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Version"/>
</rdf:Description>
<rdf:Description rdf:about="urn:scalar:path:1817392:832371:10">
<scalar:urn rdf:resource="urn:scalar:path:1817392:832371:10"/>
<oac:hasBody rdf:resource="http://scalar.usc.edu/works/lexos/manual.10"/>
<oac:hasTarget rdf:resource="http://scalar.usc.edu/works/lexos/word-cloud.6#index=10"/>
<rdf:type rdf:resource="http://www.openannotation.org/ns/Annotation"/>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/word-cloud">
<rdf:type rdf:resource="http://scalar.usc.edu/2012/01/scalar-ns#Composite"/>
<scalar:isLive>1</scalar:isLive>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/6902"/>
<dcterms:created>2015-06-04T10:07:32+00:00</dcterms:created>
<scalar:urn rdf:resource="urn:scalar:content:159678"/>
<scalar:version rdf:resource="http://scalar.usc.edu/works/lexos/word-cloud.6"/>
<dcterms:hasVersion rdf:resource="http://scalar.usc.edu/works/lexos/word-cloud.6"/>
<scalar:citation>method=instancesof/content;methodNumNodes=63;</scalar:citation>
</rdf:Description>
<rdf:Description rdf:about="http://scalar.usc.edu/works/lexos/word-cloud.6">
<ov:versionnumber>6</ov:versionnumber>
<dcterms:title>The Word Cloud Tool</dcterms:title>
<dcterms:description>Manual page for the Lexos Word Cloud tool</dcterms:description>
<sioc:content><p>Word clouds are a method of visualizing the <strong>Document-Term Matrix</strong>. They present terms arranged at angles for compactness, with each term sized according to its frequency within the text. Word clouds enable you to get a sense of the content in your corpus, and they are very good for presentations. However, they also have some well-known limitations (see the topics article on <a href="">visualizing texts with word clouds</a>). In some languages, individual tokens may not correspond to words, which will limit the usefulness of this method of visualization.</p><p>The Lexos <strong>Word Cloud</strong> tool uses Jason Davies&rsquo; excellent <a target="_blank" href="https://www.jasondavies.com/wordcloud/">word cloud generator for d3.js</a>&mdash;with a few modifications&mdash; to create beautiful, interactive word clouds. This implementation scales the size to ensure all terms fit within the layout. The color used for each does not convey meaning and is used only for aesthetic purposes.</p><h3>Generating Word Clouds</h3><p>Lexos allows you to choose some or all of your active documents from which to generate a word cloud. Once you have selected your documents using the checkboxes at the top right, click the <strong>Get Graph</strong> button. After a few seconds, a word cloud will fade into view (be patient if you have selected large or many documents). Running your mouse cursor over each term in the word cloud will generate a tooltip showing the number of times it occurs in the documents you have selected. Click on the <strong>View Counts Table</strong> button next to <strong>Get Graph</strong> or below the word cloud to open a dialog containing a searchable, sortable table of the term counts in your word cloud.</p>&nbsp;<p><strong>Warning</strong>: The d3.js algorithm used by Lexos has an important limitation. It attempts to layout terms in as compact a manner as possible and is sometimes unable to find a fit for high frequency words. In these cases, these words are dropped from the word cloud. Because of this limitation, we highly recommend that you view the Counts Table in order to make sure that all the most frequent words are represented in the word cloud. If you find that this is not the case, try generating the word cloud again. Sometimes it will find a better layout in which the high frequency words fit. Using the layout options described below may also allow you to produce word clouds in which the missing words fit within the layout.</p><h3>Layout Options</h3><p>Davies&rsquo; word cloud generator offers some useful ways to modify the layout using the controls below the graph. After modifying the settings, you can re-generate the word cloud by clicking anywhere on the graph. Each of the settings is described in detail below:</p><h4><u>Spiral</u></h4><p>This refers to the method of calculating the angles and placement of terms in the layout. The <strong>Archimedean</strong> setting uses the <a href="https://en.wikipedia.org/wiki/Archimedean_spiral"></a><a target="_blank" href="https://en.wikipedia.org/wiki/Archimedean_spiral">Archimedean spiral</a> to determine the layout. The <strong>Rectangular</strong> setting attempts to place terms within a rectangular shape.</p><h4><u>Scale</u></h4><p>This refers to how individual terms are sized relative to one another in the word cloud. Settings are <code>log n</code> (logarithmic scale), <code>&radic;n</code> (square root scale), and <code>n</code>, where <code>n</code> refers to the number of times an individual term occurs. <code>log n</code> and <code>&radic;n</code> are methods of transforming this number based on the possible minimum and maximum values. No single scaling is inherently superior to the others, but they will produce different effects in the layout. Using the <code>n</code> scale setting will preserve the original proportionality of the values as far as possible. <code>log n</code> may aid the differentiation of data that is not uniformly distributed. The square root transformation will inflate smaller numbers but stabilize the size of larger ones.</p><h4><u>Font</u></h4><p>You can change the appearance of your word cloud by setting the font here. This feature should work with any font installed on your system.</p><h4><u>Orientation Settings</u></h4><p>In the middle of the <strong>Layout Options</strong> controls is a form to set the number of different orientations terms can have in the layout. You can also set the range of angles, either by setting the number of degrees in the form fields or by dragging the angles in the image below them.</p><h4><u>Number of Words</u></h4><p>By default, Lexos includes the top 250 terms in your documents. Use this setting to modify the number. Limiting the number of terms may help you to include high frequency terms which are dropped by the layout algorithm.</p><h4><u>Download</u></h4><p>Word clouds are downloadable in either SVG or PNG format. SVG images are very useful because they scale well in web browsers. If you click the SVG button, a new window will open with a copy of your word cloud. Use your browser&rsquo;s <strong>Save as&hellip;</strong> function to save the web page. If you click the PNG button, the image will open in a new window. The procedure for saving a PNG image is not standard in all browsers, so follow the instructions you see on the screen.</p></sioc:content>
<scalar:defaultView>plain</scalar:defaultView>
<scalar:continue_to_content_id>159678</scalar:continue_to_content_id>
<prov:wasAttributedTo rdf:resource="http://scalar.usc.edu/works/lexos/users/3693"/>
<dcterms:created>2016-08-15T10:17:23+00:00</dcterms:created>