08448380779 Call Girls In Greater Kailash - I Women Seeking Men
The research infrastructure perspective, Dieter Van Uytvanck, CLARIN
1. The Perfect Swell:
Workshop on Text and Data Mining
for Data Driven Innovation
The research infrastructure perspective
Dieter Van Uytvanck
Max Planck Institute for Psycholinguistics
Dieter.VanUytvanck@mpi.nl
TDM workshop, London
2013-09-27
2. CLARIN?
§ Common Language Resources and Technology
Infrastructure
§ aims at providing easy and sustainable access for scholars
in the humanities and social sciences
§ to digital language data (in written, spoken, video or
multimodal form)
§ to advanced tools to discover, explore, exploit, annotate,
analyse or combine them
§ independent of where they are located: a shared
distributed infrastructure
§ More information: www.clarin.eu
TDM workshop
London
2013-09-27
www.clarin.eu
3. Language resources: rich variety
§ Modality: written, spoken, signed
§ Additional channels: eye movements, gestures, neuro-
imaging data (EEG, fMRI, …), etc.
TDM workshop
London
2013-09-27
www.clarin.euAnnotations
Data: the basis for research
4. Language resources: rich variety
§ Location:
§ data from all over the world (including
some very remote corners)
§ … and from the world wide web,
smartphones, …
§ Time:
§ old historic collections (hieroglyphs,
manuscripts, rock carvings, …), often
OCR’ed, digitised and annotated
§ up to real-time data gathered from
social networks
§ Origin:
§ elicited (experiments)
§ natural language use (“in the wild”)
TDM workshop
London
2013-09-27
www.clarin.eu
Annotations
a: the basis for research
5. Data mining in CLARIN
§ very important paradigm in language resource processing
§ major shift from rule-based to data-driven systems
§ not only text, also multimedia
§ importance of
§ access to primary data for fellow researchers: need access to
whole works and not only to snippets and sentences in order
to do TDM.
§ replicating experiments utterly important
§ technical support: virtual collections allow to refer to large online
data sets
§ safe legal setting for researchers (license signing does not scale
to 500.000 texts that are automatically collected from thousands
of websites)
TDM workshop
London
2013-09-27
www.clarin.eu
6. Data mining in CLARIN
§ some examples to demonstrate the variation and nature of
data mining based on language resources
TDM workshop
London
2013-09-27
www.clarin.eu
7. Some examples (1)
TDM workshop
London
2013-09-27
www.clarin.eu
§ Mass text analysis (Petersen et al., 2012):
doi:10.1038/srep00313
8. Some examples (2)
TDM workshop
London
2013-09-27
www.clarin.eu
§ AUVIS face/hand tracking analysis: http://tla.mpi.nl/
projects_info/auvis/
Head/Hands Tracking
9. Some examples (3)
TDM workshop
London
2013-09-27
www.clarin.eu
§ Stylometry and plagiarism detection
http://www.clips.ua.ac.be/category/projects/stylometry
§ e.g. Mike Kestemont, http://www.mike-kestemont.org/?p=362
10. Some examples (4)
TDM workshop
London
2013-09-27
www.clarin.eu
§ Language evolution analysis with phylogenetic trees (Bouckaert
et al., 2012) – doi:10.1126/science.1219669
At the other extreme, we fit a “sailor” model with
no reluctance to move into water and rapid move-
ment across water. Consistent with the findings
based on the RRW model, each of the landscape-
based models supports the Anatolian farming
theory of Indo-European origin (Table 1).
Our results strongly support an Anatolian
homeland for the Indo-European language family.
The inferred location (Fig. 1) and timing [95%
highest posterior density (HPD) interval, 7116 to
10,410 years ago] of Indo-European origin is con-
gruent with the proposal that the family began
to diverge with the spread of agriculture from
Fig. 2. Map and maximum clade credibility tree showing the diversification
of the major Indo-European subfamilies. The tree shows the timing of the
emergence of the major branches and their subsequent diversification. The
inferred location at the root of each subfamily is shown on the map, colored
to match the corresponding branches on the tree. Albanian, Armenian, and
Greek subfamilies are shown separately for clarity (inset). Contours represent
the 95% (largest), 75%, and 50% HPD regions, based on kernel density
estimates (15).
Phylogeographic analysis
Bayes factor
Anatolian vs. steppe I Anatolian vs. steppe II
RRW: All languages 175.0 159.3
RRW: Ancient languages only 1404.2 1582.6
RRW: Contemporary languages only 12.0 11.4
Landscape aware: Diffusion 298.2 141.9
Landscape aware: Migration from land into water less
likely than from land to land by a factor of 10
197.7 92.3
Landscape aware: Migration from land into water less
likely than from land to land by a factor of 100
337.3 161.0
Landscape aware: Sailor 236.0 111.7
onAugust24,2012www.sciencemag.orgDownloadedfrom
11. The research infrastructure role
§ Data sets:
§ Long-term preservation (archiving)
§ Making them citable (persistent identifiers) and findable
(metadata)
§ Making access easier with federated login
§ Lowering the threshold to use advanced software
§ offer web front-ends, web service chains
§ cooperation with computing centres for heavy tasks
§ Know-how building & support
§ about the nature of the resources and tools
§ technical matters
§ legal issues
TDM workshop
London
2013-09-27
www.clarin.eu
12. Legal perspective on resources
TDM workshop
London
2013-09-27
www.clarin.eu
§ Rough classification of language resources
available via the CLARIN centres:
§ Public
§ full access, no restrictions at all
§ e.g. parallel corpora from the EU Parliament
§ Academic
§ available for all academic users
§ e.g. corpus spoken Dutch (radio recordings, …)
§ Restricted
§ everything more restricted than Academic >
personalised access rules
§ e.g. video from doctor-patient interaction
Examples of each process
Resource
2.12.2010
Figure 6 Three main cont
the additional requiremen
3.3 The prerequisit
The CLARIN prototype s
Examples of each process
Resource
2.12.2010
Figure 6 Three main cont
the additional requiremen
The summary of the class
Figure 5 above.
The CLARIN prototype s
Examples of each process
Resource
2.12.2010
Figure 6 Three main cont
the additional requiremen
13. Legal perspective on resources
§ CLARIN recommends CC licenses for new resources as
this is the least problematic for all in the long run. Such
resources can be made publicly available.
§ For older material, we try to distribute them as freely as
can be negotiated. For these we offer two categories:
§ resources free for researchers
§ resources requiring individual permission by the owner.
§ It is good to note that not everything is about copyright.
§ We also have to deal with personal data which can only be
provided for a limited time to individual researchers unless
they are anonymized.
§ Also ethical perspectives should be taken into account. (e.g.
asking participants if they are ok with data mining/processing
at the time of recording)
TDM workshop
London
2013-09-27
www.clarin.eu
14. Technical Perspective (1)
§ The above restrictions can be realized by requiring:
§ PUB - no identification of the user and no individual
permission, i.e. the resources are free for all and publicly
available.
§ ACA - identification of the user, but no individual
permission, e.g. CLARIN-distributed resources for academic
use.
§ RES - identification of the user and individual usage
permission, i.e. the resources are restrictedly available to
individual researchers, e.g. resources containing personal
data.
TDM workshop
London
2013-09-27
www.clarin.eu
15. Technical Perspective (2)
§ Federated Identity Management (“Shibboleth”)
§ allows to access resources at a remote server
§ with institutional credentials
§ makes it relatively straight-forward to recognize academic
users and grant them access to restricted resources
§ details: http://clarin.eu/node/3788
TDM workshop
London
2013-09-27
www.clarin.eu
16. Future perspective for legal
exception framework
§ As we in CLARIN are capable of
§ identifying researchers and
§ protecting the resources from other users,
§ CLARIN already has all the technical prerequisites needed
for implementing and supervising a broad research
exception in the EU such as the one already in effect in the
Netherlands.
TDM workshop
London
2013-09-27
www.clarin.eu
17. Conclusion
§ Datamining plays an increasingly important role in
(language resource-based) research
§ Research infrastructures try to assist academics to make
efficiently use of the existing resources and tools
§ Many technical issues have been addressed already
(e.g. authentication of researchers)
§ We hope remaining legal (copyright) issues could be
addressed by a research exception (or likewise a concept
of fair use)
TDM workshop
London
2013-09-27
www.clarin.eu
18. Acknowledgement
§ Thanks to Krister Lindén and Erik Ketzan from the
CLARIN legal issues committee for their valuable
input!
§ Thank you for your attention!
TDM workshop
London
2013-09-27
www.clarin.eu