5. The preliminary ideas that would result
in the development of eDictor in 2007
started in 2004 with a project that aimed
at restructuring the text-preparation
system at the Tycho Brahe Corpus.
>
2004-2006
7. Essentially, the idea was that the Corpus
would be constituted of
single-source documents
that could contain all relevant annotations
(textual, philological, linguistic).
>
2004-2006
8. This was achieved in partnership with
computer scientist Thorsten Trippel, from the
University of Bielefeld.
He suggested we used the XML annotation
language to re-encode the Corpus, and XSLT to
transform each document into different
presentations of the encoded information.
>
2004-2006
9. Our central idea was to encapsulate edition
interferences at the word level, i.e. for each
token in the corpus – so that each element of
the pair would be available to different modules
of analysis.
>
2004-2006
10. This first idea was applied to a few pilot texts, and
published as a poster at the annual conference of the
ALLC in 2004
PAIXÃO DE SOUSA, M. C.; TRIPPEL, T. Single source process
Historic corpora for diverse uses.
In: Proceedings of the Association for Literary and Linguistic
Computing (ALLC) Annual Conference, 2004.
>
2004-2006
11. In 2005, the Corpus went through a complete
re-encoding process.
2004-2006
>
12. The restructured Corpus was composed
of XML documents that, via XSLT
transformations, would render different
(HTML and TXT) versions, adequate
for different visualization and processing needs,
as we had originally planned.
>
2004-2006
17. The Tycho Brahe Corpus, restructured (simple text for further processing)
[ prologue (author: P.M. Gandavo)]
[ title: AO MUITO ILUSTRE SENHOR DOM LIONIS PEREIRA, Epístola de Pero de Magalhães. ]
[g_008_s_43] Neste pequeno serviço (muito ilustre senhor ) que ofereço a Vossa Mercê das primícias de meu
fraco entendimento, poderá em alguma maneira conhecer os desejos que tenho de pagar com minha
possibilidade alguma parte do muito que se deve à ínclita fama de vosso heróico nome.
[g_008_s_44] E isto assim pelo merecimento do nobilíssimo sangue e clara progênie de onde traz sua origem,
como pelos troféus das grandes vitórias , e casos bem afortunados que lhe hão sucedido nessas partes do
Oriente em que Deus o quis favorecer com tão larga mão, que não cuido ser toda minha vida bastante para
satisfazer à menor parte de seus louvores .
[g_008_s_45] E como todas estas razões me ponham em tanta obrigação , e eu entenda que outra nenhuma
coisa deve ser mais aceita a pessoas de altos ânimos que a lição das escrituras , por cujos meios se alcançam
os segredos de todas as ciências , e os homens vêm a ilustrar seus nomes e perpetuar os na terra com fama
imortal , determinei escolher a Vossa Mercê entre os mais senhores da terra , e dedicar lhe esta breve história .
[g_008_s_46] A qual espero que folgue de ver com atenção e receber me a benignamente debaixo de seu
amparo : assim por ser coisa nova , e eu a escrever como testemunha de vista : como por saber quão particular
afeição Vossa Mercê tem às coisas do engenho , e que por esta causa lhe não será menos aceito o exercício das
escrituras , que o das armas.
[g_008_s_47] Por onde com muita razão favorecido desta confiança possa seguramente sair a luz com esta
pequena empresa e divulgar a pela terra sem nenhum receio , tendo por defensor dela a Vossa Mercê Cuja muito
ilustre pessoa nosso Senhor guarde e acrescente sua vida e estado por longos e felizes anos .
[ end prologue ]
18. Along with the application of the new single-
source system to the Corpus, new ideas started
to pop up.
Some of them were carried on, some were not.
2004-2006
>
19. The main thing that we wanted to do back then
and still have not done is...
... to integrate syntactic annotation
into this same, single-source system...
2004-2006
>
20. Other ideas were a little more fruitful: the
integration of other, less complex levels of
linguistic annotation (such as items of
lexicological interest); and the expansion of the
system to include the possibility of critical
editions, in which more than one version of the
same text could be compared.
2004-2006
>
21. PAIXÃO DE SOUSA, M. C. A Anotação da variação de grafia no Corpus
Histórico do Português Tycho Brahe: Frentes abertas para estudos do léxico. V
Encontro de Corpora: Lingüística de Corpus: a aplicabilidade nos estudos sobre
Léxico, São Carlos, 2005.
22. PAIXÃO DE SOUSA, M. C. Memórias do Texto. Mesa-redonda “Bibliotecas e bancos de
dados digitais de literatura”, II Simpósio Nacional de Literatura e Informática,
Florianópolis, 2005.
Published in 2006 as:
PAIXÃO DE SOUSA, M. C. Memórias do Texto. Texto Digital (UERJ), v. 1, p. 10, 2006.
23. PAIXÃO DE SOUSA, M. C. Critical Hipereditions and the new challenges for text-
critique. Seminário Internacional Literaturas: Del texto al hipertexto. Madri, Universidade
Complutense, setembro de 2006.
Published in 2007 as:
PAIXÃO DE SOUSA, M. C. Digital Text: Conceptual and methodological frontiers. In: Dolores
Romero; Amelia Sanz. (Org.). Literatures in the Digital Era: Theory and Praxis. Cambridge:
Cambridge Scholarly, 2007.
24. By 2006 the single-source encoding system was
mature; a first manual was prepared and a more
complete paper on these results was published.
>
2004-2006
26. TRIPPEL, T.; PAIXÃO DE SOUSA, M. C. Metadata and XML standards
at work: a corpus repository of Historical Portuguese texts. V
International Conference on Language Resources and Evaluation (LREC),
2006.
27. TRIPPEL, T.; PAIXÃO DE SOUSA, M. C. Metadata and XML standards
at work: a corpus repository of Historical Portuguese texts. V
International Conference on Language Resources and Evaluation (LREC),
2006.
28. Meanwhile...
... as the system was presented to a wider range
of potential users outside Tycho Brahe,
new challenges emerged.
>
2004-2006
29. I Oficina de Anotação – Projeto CorPorA.
Salvador, 19-21 de abril, 2006.
30. The 1st annotation workshop outside the Tycho
Brahe team, in 2006 in Salvador, was an
important breakthrough.
It was then that we noticed that the original
techniques used to annotate the XML
documents (“by hand”, in E-Macs) and to
transform them (by coding XSL into the system
via Saxon) was not adequate for teams with a less
computational, and more philological
background.
>
2004-2006
31. I Oficina de Anotação – Projeto CorPorA.
Salvador, 19-21 de abril, 2006.
32. After the workshop in 2006 it became clear that
if we wanted more teams to use the single-
source annotation system, we would have to
build a software that could perform the
annotation and transformation tasks in a
user-friendly interface.
In other words... it was then that the idea of
eDictor took shape.
>
2004-2006
35. eDictor beta 1.0 was developed in 2007 by
Prof. Fabio N. Kepler (then a post-
graduate student at IME-USP’s computer science
program), and was first presented in the same
year at the VI Encontro de Linguística
de Corpus, at USP.
2007
>
36. PAIXÃO DE SOUSA, M. C.; KEPLER, F. N. E-dictor: uma
ferramenta integrada para a anotação de edição e classe de
palavras. VI Encontro de Lingüística de Corpus, São Paulo, 2007.
37. 2007
This first version of eDictor
contained the core functions
of the original text encoding system:
an XML annotation module
and the possibility of XSLT
transformation exportation.
>
38. 2007
Plus... it included a
morphosyntactic tagging function!
This first version of eDictor
contained the core functions
of the original text encoding system:
an XML annotation module
and the possibility of XSLT
transformation exportation.
>
42. Two important aspects mark the years
2008 to 2012 for the development of eDictor.
The first was the arrival of a new team member,
Pablo P. F. Faria, who joined F. Kepler in
developing the software after the first version.
>
2008-2012
43. The second important aspect was that, while
up to 2008 the main application of the single-
source system (first manually and later with
eDictor) was the restructuring of the Tycho
Brahe Corpus, after 2008 the system started to
be used beyond Tycho Brahe.
>
2008-2012
44. >
2008-2012
This was important because, as the different
projects have different aims, the tool started to
include new technical aspects.
The second important aspect was that, while up to
2008 the main application of the single-source
system (first manually and later with eDictor)
was the restructuring of the Tycho Brahe
Corpus, after 2008 the system started to be
used beyond Tycho Brahe.
45. > For instance, in 2009 eDictor started to be used
by the Brasiliana USP team.
One of the main particularities of this context
was that eDictor was used as a corrector for
automatic character recognition (OCR)
– and new edition categories had to be created.
2008-2012
46. PAIXÃO DE SOUSA, M. C. Desafios do processamento de textos antigos: primeiros
experimentos na Brasiliana Digital . I Workshop de Linguística Computacional da USP,
2009.
47. PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. P. F. O Processamento
automático de textos antigos: Desafios e Experiências. Workshop de Linguística de Corpus
do Projeto Para a História do Português Brasileiro (PHPB), São Paulo, 2010.
48. PAIXÃO DE SOUSA, M. C. Desafios do processamento de textos antigos: primeiros
experimentos na Brasiliana Digital . I Workshop de Linguística Computacional da USP,
2009.
49. PAIXÃO DE SOUSA, M. C. Desafios do processamento de textos antigos: primeiros
experimentos na Brasiliana Digital . I Workshop de Linguística Computacional da USP,
2009.
(Abbyy Finereader 10.0 training module)
50. <w id="s_6#86">
<o> amiſjade</o>
<e t="ocr">amiſſade</e>
<e t="gra">amissade</e>
<e t="mod">amizade </e>
<m v="N"/>
</w>
PAIXÃO DE SOUSA, M. C. Desafios do processamento de textos antigos: primeiros
experimentos na Brasiliana Digital . I Workshop de Linguística Computacional da USP,
2009.
51. > One important consequence for eDictor was
the possibility of adding new edition categories
to the tools Preference archive.
52. > Some of these developments were presented
at the VIII Encontro de Linguística
de Corpus in 2009 by Pablo Faria; this
presentation would be published as a book
chapter in 2010.
53. PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. E-dictor: Novas
perspectivas na codificação e edição de corpora de textos históricos. In:
VIII Encontro de Linguística de Corpus, 2009, Rio de Janeiro. 2009.
55. Example of changes after 1.0 beta 001:
Edition Tab – “edition” became an open category
56. > More importantly, researchers that used
manuscript documents became interested in
eDictor.
The special needs of this kind of material led
to very important developments in the tool.
2008-2012
57. > The first group of manuscript documents to
be worked with the tool was the corpus of
XIXth century letters from the PhD thesis of
Zenaide Carneiro (2005) – now part of the
corpus CEDOH.
The edition of this corpus in XML had been
idealized at the time of the 2006 workshop in
Salvador - and from the start, it brought to
the development of eDictor the challenge of
dealing with particular categories and edition
needs of manuscripts.
2008-2012
58. > One important example of developments
brought by the needs of manuscript editors
are the fac-simile view functionalities.
They were developed by Pablo Faria after
eDictor started to be used by the team at
CEDOH and by the team lead by Celia
Lopes at LaborHistórico, at UFRJ.
2008-2012
61. This new exporting format - Hypertext with fac-
simile view – was integrated in later versions of
eDictor, and is currently used by other projects.
62. LaborHistorico – Laboratório para a História do Português Brasileiro,
Universidade Federal do Rio de Janeiro. Coord. Célia Lopes
Workshop: “Edição Digital e Divulgação de Textos Antigos”,
Rio de Janeiro, 3-5 de fevereiro, 2010.
63. The corpus at LaborHistorico,
with integrated fac-simile view of manuscripts.>
64. > The corpus at LaborHistorico,
with integrated fac-simile view of manuscripts.
65. > The workshops with the new teams of
users, organized between 2010-2012,
resulted in the development of new builds
for eDictor beta 1.0 – and also, thanks to
the expansion in the number of users,
in 2010 we finally got to make a
manual...
2008-2012
67. First Version of eDictor’s Manual (2010)
(... actually, the only version so far)
68. > As a result of this
expansion, between
2009 and 2012
ten builds of eDictor
beta 1.0 were made,
reflecting the additions
that were pointed out as
necessary by the
different user teams.
2008-2012
69. Two important publications were prepared
during this period: a poster session at the
ALC meeting of 2010, presented by P. Faria,
and the chapter for the book “Caminhos da
Linguística de Corpus”.
In these papers we tried to cover the
backgound on eDictor’s creation, the new
developments, and the challenges ahead.
2008-2012
>
70. FARIA, P. P. F.; PAIXÃO DE SOUSA, M. C.; KEPLER, F. N. An Integrated Tool for
Annotating Historical Corpora. The Fourth Linguistic Annotation Workshop (LAW IV) at
The 48th Annual Meeting of the Association for Computational Linguistics (ALC 2010),
Uppsala, 2010.
71. PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. E-dictor: Novas
perspectivas na codificação e edição de corpora de textos históricos. In: Tania
Shepherd; Tony Berber Sardinha; Marcia Veirano Pinto. (Org.). Caminhos da
linguística de corpus. Campinas: Mercado de Letras, 2010.
74. > eDictor 1.0 beta build 010 is the current
version under use. The main differences
in comparison to beta 001 are the
additions related to fac-simile
integration (in transcription module
and in export functionalities) and some
bug-fixing in the editions module.
But there are still bugs to be busted!
2013
79. Version 1.0 beta b010 of eDictor is currently being used
by seven projects in Brazil and in Portugal
>
80. Corpus Anotado do Português Tycho Brahe
(Universidade Estadual de Campinas)
Grupo de Pesquisas Humanidades Digitais
(Universidade de São Paulo)
Laboratório de História do Português Brasileiro
(Universidade Federal do Rio de Janeiro)
P.S. – Projeto Arquivo Digital de Escrita Quotidiana em Portugal e Espanha na Época Moderna
(Universidade de Lisboa)
Corpus Eletrônico de Documentos Históricos do Sertão, CEDOHS
(Universidade Federal de Feira de Santana)
Memória Conquistense
(Universidade Estadual do Sudoeste da Bahia)
> Version 1.0 beta b010 of eDictor is currently being used
by seven projects in Brazil and in Portugal
81. There is still a lot to be done
if we want to make eDictor
a stable and fully transferrable
tool.
but of course ...>
82. The spirit of this tool has been one of
growing into the users’ needs and
requests. It will become a better
tool if we work together on what
we want it to be.
>
84. So we are very excited
about this workshop!
Here’s one idea of
how we could work:
>
85. We are launching today (09/09/2013) a new webpage for eDictor, at
http://manualedictor.wordpress.com/.
86. We are launching today (09/09/2013) a new webpage for eDictor, at
http://manualedictor.wordpress.com/.
We could use these days at the workshop
to build more documentation and group it on the page.