1. Organization is Sharing:
From eScience to
Personal Information Management
Rodrigo Dias Arruda Senra
Advisor: Profa Dra. Claudia Bauzer Medeiros
Defesa de Tese de Doutorado em Ciência da Computação
Universidade Estadual de Campinas
Instituto de Computação
Campinas 2012-12-10
18. 11
SciFrame
The Scientific Digital Data Processing Framework is a
conceptual framework that describes systems or
processes involving digital data manipulation.
33. Data Management
✓enforce loose coupling between Apps and DBMS
✓DBMS product/vendor independence
✓seamless cross-database migration
✓capability verification, validation and negotiation
✓support Apps and DBMS in the cloud!
39. Architecture
15
Web
DMS X
DMS Y
DMS Z
Descriptor
Registry
Descriptor
RegistryDescriptor
RegistryDescriptor
Registry
descriptor X
descriptorY
40. Architecture
15
Web
DMS X
DMS Y
DMS Z
Descriptor
Registry
Descriptor
RegistryDescriptor
RegistryDescriptor
Registry
App
descriptor X
descriptorY
41. Architecture
15
Web
DMS X
DMS Y
DMS Z
Descriptor
Registry
Negotiator
Descriptor
RegistryDescriptor
RegistryDescriptor
Registry
App
descriptor X
descriptorY
42. Architecture
15
Web
DMS X
DMS Y
DMS Z
Descriptor
Registry
Negotiator
Descriptor
RegistryDescriptor
RegistryDescriptor
Registry
App
descriptor X
descriptorY
43. Architecture
15
Web
DMS X
DMS Y
DMS Z
Descriptor
Registry
Negotiator
Descriptor
RegistryDescriptor
RegistryDescriptor
Registry
App
descriptor X
descriptorY
binding
52. Problems
30
1. Single Category versus Multi-faceted Content
2. Manually-defined categories
3.Criteria is not explicit
4.Static Membership Relation
5. Organization is not reusable
89. Related Work (SciFrame)
• CLRC scientific metadata model
B. Matthews and S. Sufi
The CLRC Scientific Metadata Model, version 1, DL TR 02001, CLRC
2001
• myGrid Information Model
Sharman, Nick, et al.
"The myGrid information model." UK e-Science programme All Hands Conference.
2004.
90. Related Work (DBDs)
Madnick and Wang.
EvolutionTowards Strategic Applications Of DatabasesThrough
Composite Information Systems.
Journal of Management Information Systems 5(2):5-22 1988
“In order to: separate data from the application processing, it is necessary to employ a
process descriptor and a database descriptor.
The process descriptor describes the name, the input/output data requirement, and other
resource requirements of the processing components.
The database descriptor contains information about the data (e.g., data model, schema,
access rights) in the database, similar to data dictionaries.
These two descriptors can be used by the execution environment to coordinate the
interaction between the processing component and the database.”
91. Related Work (Organographs)
•Topic Modeling
LSA, LDA, Hierarchical Bayesian
Blei 201; Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002; 2003; 2004; Hofmann, 1999;
2001
• Personal Information Management
CALO, UMEA, X-COSIM, Haystack, UpLib, Iris
Zimmermann 2005; Arndt 2007; Lansdale 1988; Kaptelinin 2003; Janssen & Popat 2003;
Karger et al 2003
• Semantic Desktop
Nepomuk, SEMSOC
Giannakidou et al 2008; Groza et al 2007
• Personal Digital Libraries
Zotero, Mendeley, Papers
94. Publications
submitted to
JODS
Evaluating, Reorganizing and Sharing Digital Information Hierarchies.
Rodrigo D. A. Senra, Claudia B. Medeiros.
Journal on Data Semantics (submetido em 2012-10-25)
2011
Organographs - Multi-faceted Hierarchical Categorization of Web Documents.
Rodrigo D. A. Senra, Claudia B. Medeiros.
Proceeding of the 7th International Conference on Web Information Systems and Technologies - WEBIST: 583-588
2010
Database Descriptors: Laying the Path to Commodity Web Data Services.
Rodrigo D. A. Senra, Claudia B. Medeiros.
Proceedings of Engineering of Computer-Based Systems (ECBS): 386-392
2009
SciFrame: a conceptual framework to describe data sharing in eScience.
Rodrigo D. A. Senra, Claudia B. Medeiros.
Proceedings of the III Brazilian eScience workshop (XXIV SBBD)
2009
A standards-based framework to foster geospatial data and process interoperability.
Gilberto Z. Pastorello Jr., Rodrigo D. A. Senra, Claudia B. Medeiros.
Journal of the Brazilian Computer Society 15(1): 13-25
2008
Bridging the gap between geospatial resource providers and model developers.
Gilberto Z. Pastorello Jr., Rodrigo D. A. Senra, Claudia B. Medeiros.
Proceedings of the 16th International Conference on Advances in Geographic Information Systems - ACM SIGSPATIAL
2007
O projeto WebMAPS: desafios e resultados.
Carla G. N. Macário, Claudia B. Medeiros, Rodrigo D. A. Senra.
Proceedings of 9th Brazilian Symposium on Geoinformatics - GeoInfo: 239-250
47
95. Publications
submitted to
JODS
Evaluating, Reorganizing and Sharing Digital Information Hierarchies.
Rodrigo D. A. Senra, Claudia B. Medeiros.
Journal on Data Semantics (submetido em 2012-10-25)
2011
Organographs - Multi-faceted Hierarchical Categorization of Web Documents.
Rodrigo D. A. Senra, Claudia B. Medeiros.
Proceeding of the 7th International Conference on Web Information Systems and Technologies - WEBIST: 583-588
2010
Database Descriptors: Laying the Path to Commodity Web Data Services.
Rodrigo D. A. Senra, Claudia B. Medeiros.
Proceedings of Engineering of Computer-Based Systems (ECBS): 386-392
2009
SciFrame: a conceptual framework to describe data sharing in eScience.
Rodrigo D. A. Senra, Claudia B. Medeiros.
Proceedings of the III Brazilian eScience workshop (XXIV SBBD)
2009
A standards-based framework to foster geospatial data and process interoperability.
Gilberto Z. Pastorello Jr., Rodrigo D. A. Senra, Claudia B. Medeiros.
Journal of the Brazilian Computer Society 15(1): 13-25
2008
Bridging the gap between geospatial resource providers and model developers.
Gilberto Z. Pastorello Jr., Rodrigo D. A. Senra, Claudia B. Medeiros.
Proceedings of the 16th International Conference on Advances in Geographic Information Systems - ACM SIGSPATIAL
2007
O projeto WebMAPS: desafios e resultados.
Carla G. N. Macário, Claudia B. Medeiros, Rodrigo D. A. Senra.
Proceedings of 9th Brazilian Symposium on Geoinformatics - GeoInfo: 239-250
47
SciFrame
WebMaps
DBDs
Organographs
96. Extensions
Theoretical Practical
SciFrame • formalize design pattern
• enhance the operations vocabulary
• online catalog of eScience systems
• describe as ontology (RDF)
Database
Descriptors
• analyse negotiation frameworks
• expand DBDs expressivity
• explore ranking algorithms
• catalog of concrete DBDs
• adapt Organicer to use DBDs
• experiment with dynamic negotiation
Organographs • model with CategoryTheory
• explore DSLs to describe forg
• support non-textual media (eg.:img)
• expand component palette
48
97. Agradecimentos
• Laboratório de Sistemas de Informação (IC-Unicamp)
http://www.lis.ic.unicamp.br
• Brazilian Institute for Web Science Research
http://webscience.org.br
• Fapesp - CNPQ - CAPES
49
105. Hierarquia
de Origem
Workflow de Transformação
Extração
NLTK
Pre-processamento
BeautifulSoup
pyPdf
Índice de
Facetas
pymongo
106. Hierarquia
de Origem
Workflow de Transformação
Extração
NLTK
Pre-processamento
BeautifulSoup
pyPdf
Índice de
Facetas
pymongo
networkx gensim
numpy scikit-learn
107. Hierarquia
de Origem
Workflow de Transformação
Hierarquia
Resultante
Visualização
Extração
NLTK
Pre-processamento
BeautifulSoup
pyPdf
Índice de
Facetas
pymongo
networkx gensim
numpy scikit-learn
108. Hierarquia
de Origem
Workflow de Transformação
Hierarquia
Resultante
Visualização
Extração
NLTK
Pre-processamento
BeautifulSoup
pyPdf
Índice de
Facetas
pymongo
networkx gensim
numpy scikit-learn
matplotlib
ObsPy
InfoViz.js
D3.js
109. Hierarquia
de Origem
Workflow de Transformação
Hierarquia
Resultante
Visualização
Navegação da
Hierarquia
Iterador
Extração
NLTK
Pre-processamento
BeautifulSoup
pyPdf
Índice de
Facetas
pymongo
networkx gensim
numpy scikit-learn
matplotlib
ObsPy
InfoViz.js
D3.js
110. Hierarquia
de Origem
Workflow de Transformação
Hierarquia
Resultante
Visualização
Navegação da
Hierarquia
Iterador
Extração
NLTK
Pre-processamento
BeautifulSoup
pyPdf
Índice de
Facetas
pymongo
networkx gensim
numpy scikit-learn
matplotlib
ObsPy
InfoViz.js
D3.js
os.walk
pydelicious
evernote
123. Related Work
9
• embedded
• n-tier client/server (including web services)
• mediators
Approaches to App-to-DMS binding
Information Integration [1]
Process
• Understanding
• Standardization
• Specification
• Execution [1] Beauty and the Beast: The Theory and Practice of
Information Integration
Laura Haas
Mechanism
• Materialization
• Federation
• Indexing
124. Related Work
9
• embedded
• n-tier client/server (including web services)
• mediators
Descriptors are orthogonal to all of these!
Approaches to App-to-DMS binding
Information Integration [1]
Process
• Understanding
• Standardization
• Specification
• Execution [1] Beauty and the Beast: The Theory and Practice of
Information Integration
Laura Haas
Mechanism
• Materialization
• Federation
• Indexing
125. 66
Extração dos Dados Sensorias
dataset = gdal.Open(raster_file, GA_ReadOnly )
# Obtenção dos coeficientes para funções afins de mapeamento de coordenadas
gt = dataset.GetGeoTransform()
# Obtenção da banda de dados de interesse
band = dataset.GetRasterBand(1)
# Identificação do padrão de codificação dos dados.
# No caso do arquivo TIF os dados são bytes sem sinal ('Byte')
data_type = gdal.GetDataTypeName(band.DataType)
# Obtenção das dimensões da imagem
width, height = band.XSize, band.YSize
# Conversão do MBR do sistema de coordenadas lat/long para linha/coluna
# Xgeo = GT(0) + Xpixel*GT(1) + Yline*GT(2)
# Ygeo = GT(3) + Xpixel*GT(4) + Yline*GT(5)
ul_pixel, lr_pixel = g2p(gt,*ul_geo), g2p(gt,*lr_geo)
129. 69
Extração dos Dados
def raster2array(ul_pixel, lr_pixel, dtype='B'):
"""Using ul_pixel and lr_pixel it generates a numpy array
with the extracted interest region from the raster file
"""
col_size = lr_pixel[1]-ul_pixel[1]+1
row_size = lr_pixel[0]-ul_pixel[0]+1
scanline = band.ReadRaster(ul_pixel[1], ul_pixel[0],
col_size, row_size)
num_pixels = col_size*row_size
roi = numpy.array(struct.unpack(dtype*num_pixels, scanline))
roi.shape = (row_size, col_size)
return roi
# Read data from raster file into a numpy array
# defining a region of interest matrix
roi = raster2array(ul_pixel, lr_pixel)
130. 70
Extração da Geometria
shp = ogr.Open(filepath)
# Layer correspondente ao Estado de São paulo
layer = vf.shp.GetLayerByName('35mu500gc')
# Feature correspondente ao município de Campinas
feature = layer.GetFeature(501)
# Extração dos pontos de controle do perímetro
geometry = feature.GetGeometryRef()
poly = geometry.GetGeometryRef(0)
centroid = geometry.Centroid()
centroid_geo = centroid.GetX(), centroid.GetY()
# Definição do Retângulo Envoltório Mínimo (MBR)
lg_left, lg_right, lt_bot, lt_up = poly.GetEnvelope()
ul_geo, lr_geo = (lg_left, lt_up), (lg_right, lt_bot)