Algoritmos para el reconocimiento de estructuras de tablas

Contenido principal del artículo

Yosveni Escalona http://orcid.org/0000-0003-2992-0540

Resumen

Las Tablas son una manera bien común de organizar y publicar datos. Por ejemplo, la Web posee un enorme número de tablas publicadas en HTML integradas en documentos PDF, o que pueden ser simplemente descargadas de páginas Web. Sin embargo, las tablas no siempre son fáciles de interpretar pues poseen una gran variedad de características y son organizadas en diferentes formatos. De hecho, se han desarrollado un gran número de métodos y herramientas para la interpretación de tablas. Este trabajo presenta la implementación de un algoritmo, basado en Campos Aleatorios Condicionales (CRF, Conditional Random Fields), para clasificar las filas de una tabla como fila de encabezado, fila de datos y fila metadatos. La implementación se complementa con dos algoritmos para reconocer tablas en hojas de cálculos, específicamente, basados en reglas y detección de regiones. Finalmente, el trabajo describe los resultados y beneficios obtenidos por la aplicación del algoritmo para tablas HTML, obtenidas desde la Web, y las tablas en forma de hojas de cálculo, descargadas desde el sitio Web de la Agencia Nacional de Petróleo de Brasil.
Abstract 33 | PDF Downloads 10 PDF (English) Downloads 1

Citas

[1] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “Infogather: Entity augmentation and attribute discovery by holistic matching with web tables,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 97–108. [Online]. Available: https://doi.org/10.1145/2213836.2213848
[2] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the power of tables on the web,” Proc. VLDB Endow., vol. 1, no. 1, pp. 538–549, Aug. 2008. [Online]. Available: https://doi.org/10.14778/1453856.1453916
[3] E. Koci, M. Thiele, O. Romero, and W. Lehner, “Table identification and reconstruction in spreadsheets,” in Advanced Information Systems Engineering, E. Dubois and K. Pohl, Eds. Cham: Springer International Publishing, 2017, pp. 527–541.
[4] P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu, “Recovering semantics of tables on the web,” Proc. VLDB Endow., vol. 4, no. 9, pp. 528–538, Jun. 2011. [Online]. Available: https://doi.org/10.14778/2002938.2002939
[5] G. Limaye, S. Sarawagi, and S. Chakrabarti, “Annotating and searching web tables using entities, types and relationships,” Proc. VLDB Endow., vol. 3, no. 1–2, pp. 1338–1347, Sep. 2010. [Online]. Available: https://doi.org/10.14778/1920841.1921005
[6] T. F. Varish Mulwad and A. Joshi, “Generating Linked Data by Inferring the Semantics of Tables,” in Proceedings of the First International Workshop on Searching and Integrating New Web Data Sources, September 2011, co-located with VLDB 2011. [Online]. Available: https://bit.ly/3p8s1q0
[7] A. S. Corrêa and P.-O. Zander, “Unleashing tabular content to open data: A survey on pdf table extraction methods and tools,” in Proceedings of the 18th Annual International Conference on Digital Government Research, ser. dg.o ’17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 54–63. [Online]. Available: https://doi.org/10.1145/3085228.3085278
[8] B. Yildiz, K. Kaiser, and S. Miksch, “pdf2table: A method to extract table information from pdf files.” [Online]. Available: https://bit.ly/3k2ejBa
[9] Y. Liu, P. Mitra, and C. L. Giles, “Identifying table boundaries in digital documents via sparse line detection,” in CIKM ’08, 2008. [Online]. Available: https://bit.ly/369nWcm
[10] T. Kieninger, “Table structure recognition based on robust block segmentation,” 1998, pp. 22–32. [Online]. Available: https://bit.ly/38k4YT9
[11] M. Zhang and K. Chakrabarti, “Infogather+: Semantic matching and annotation of numeric and time-varying attributes in web tables,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’13. New York, NY, USA: Association for Computing Machinery, 2013, pp. 145–156. [Online]. Available: https://doi.org/10.1145/2463676.2465276
[12] Z. Zhang, “Towards efficient and effective semantic table interpretation,” in The Semantic Web – ISWC 2014, P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. Knoblock, D. Vrandecic, P. Groth, N. Noy, K. Janowicz, and C. Goble, Eds. Cham: Springer International Publishing, 2014, pp. 487–502. [Online]. Available: https://doi.org/10.1007/978-3-319-11964-9_31
[13] H. Masuda and S. Tsukamoto, “Recognition of html table structure,” 2004. [Online]. Available: https://bit.ly/3p8xL2Q [14] J. Fang, P. Mitra, Z. Tang, and C. L. Giles, “Table header detection and classification,” in AAAI, 2012. [Online]. Available: https://bit.ly/2IcT3vy
[15] D. Pinto, A. McCallum, X. Wei, and W. B. Croft, “Table extraction using conditional random fields,” in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ser. SIGIR ’03. New York, NY, USA: Association for Computing Machinery, 2003, pp. 235–242. [Online]. Available: https://doi.org/10.1145/860435.860479
[16] I. A. Doush and E. Pontelli, “Detecting and recognizing tables in spreadsheets,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, ser. DAS ’10. New York, NY, USA: Association for Computing Machinery, 2010, pp. 471–478. [Online]. Available: https://doi.org/10.1145/1815330.1815391
[17] E. Koci, M. Thiele, W. Lehner, and O. Romero, “Table recognition in spreadsheets via a graph representation,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), 2018, pp. 139–144. [Online]. Available: https://doi.org/10.1109/DAS.2018.48
[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 282–289. [Online]. Available: https://bit.ly/3lbW1yE
[19] J. L. Solé, Book review: Pattern recognition and machine learning. Cristopher M. Bishop. Information Science and Statistics. Springer, 2007. [Online]. Available: https://bit.ly/3l7doRq
[20] M. D. Adelfio and H. Samet, “Schema extraction for tabular data on the web,” Proc. VLDB Endow., vol. 6, no. 6, pp. 421–432, Apr. 2013. [Online]. Available: https://doi.org/10.14778/2536336.2536343