Towards TextPy, a module for processing text.
If we define annotated text as a graph with additional structure, we can make text processing more efficient, in the same way that Pandas makes processing dataframes more efficient.
1. Text as Data
TextPy Text-Fabric
⊂
Dirk Roorda
2021-02-25
1 year after the Lorentz Workshop "Processing Ancient Text Corpora"
2. How to Analyze Data with Python, Pandas &
Numpy - 10 Hour Course
• Lesson 1: Python & Jupyter Fundamentals
• Lesson 2: Numpy for data processing
• Lesson 3: Pandas for working with tabular data
• Lesson 4: Visualization with Matplotlib and Seaborn
• Lesson 5: Exploratory Data Analysis: A Case Study
• Course Project - Exploratory Data Analysis
• Find a real-world dataset of your choice online
• Use Numpy & Pandas to parse, clean & analyze data
• Use Matplotlib & Seaborn to create visualizations
• Ask and answer interesting questions about the data
codecamp
3. How to Analyze Text with Python with TextPy and
Text-Fabric - 10 Hour Course
• Lesson 1: Python & Jupyter Fundamentals
• Lesson 2: TextPy for text processing
• Lesson 3: Text-Fabric for working with annotated corpora
• Lesson 4: Visualization with Matplotlib and Seaborn
• Lesson 5: Exploratory Data Analysis: A Case Study
• Course Project - Exploratory Data Analysis
• Find a real-world corpus of your choice online
• Use Walker to convert data
• Use TextPy for quantitative analysis
• Use Text-Fabric to query text and find interesting pieces
• Use Matplotlib & Seaborn to create visualizations
tf-docs
4. What to expect
TextPy is not smart
• no linguistic knowledge
• no AI
• not an annotation tool
• not a citation finder / parallel
passage detector
• not a crowd source application
TextPy works with a text-oriented data
structure
• positions in a sequence
• embedding and overlap
• linking and connecting
• annotations
• efficient operations on this data structure
textpy
5. Example: NumPy vs OpenCV
• Image of Arabic text: open it with OpenCV
• Under the hood it is a NumPy 2-dimensional array of pixels
• Produce histograms and line boundaries by algorithms expressed in NumPy
• Show the results in the image with OpenCV
fusus
6. generous, because they do so
much work in so many
situations
Generous Python Modules
Basic models: set, list, tree, dictionary:
• standard library of the Python language
• flimsy operations
• ubiquitous use
Generic models: n-dim array, dataframe, RDF
• utility Python modules
• hard work inside the model
• usable where ever the domain can be expressed
in the model
Specific models: HTML, PDF, TEI, NLTK
• domain specific Python modules
• substantial operations
• only usable for that domain
7. A generic model for text
A text is
• a graph (basic)
with
• the first N nodes ordered in a
sequence (slots)
• all other nodes mapped to
subsets of slots
• any number of mappings
between nodes/edges and
values (annotations) tf-model
8. Supported operations
Micro
• high-speed walking through the textual
sequence
• navigating between embedders en
embeddees
• accessing feature values and weaving them
to text
• display text structures
• query on the combination of content and
spatial relationships
Macro
• convert from arbitrary XML / TEI
• convert from arbitrary TSV
• compose / modify corpora
• export - process - re-import
9. To do
To make it happen
• Split Text-Fabric into the
TextPy core and the Text-
Fabric additions
• Optimize TextPy (Cythonize,
indexing)
• distribute "wheels" for
Linux, MacOS, Windows
• Support Pandas-ish text
access
• F.gender.v(n)
• becomes
• corpus.gender[n]
To build on it
• Add volume support:
working per volume in
a corpus
• Add operations that
address multiple
volumes
• Add operations that
address multiple
corpora
• intertextuality
262 KB
74 KB
90 KB
154 KB
168 KB
134 KB
35 KB
595 KB
322 KB
917 KB