SIGMORPHON 2023 and AmericasNLP 2023

Codes and Datasets

Poster for AmericasNLP

Colex2Lang: Language Embeddings from Semantic Typology

In semantic typology, colexification refers to words with multiple meanings, either related (polysemy) or unrelated (homophony). Studies of cross-linguistic colexification have yielded insights into, e.g., psychology, historical linguistics and cognitive science (Xu et al., 2020; Brochhagen and Boleda, 2022; Schapper and Koptjevskaja-Tamm, 2022). While NLP research up until now has mainly focused on integrating syntactic typology (Naseem et al., 2012; Ponti et al., 2019; Chaudhary et al., 2019; Üstün et al., 2020; Ansell et al., 2021; Oncevay et al., 2022), we here investigate the potential of incorporating semantic typology, of which colexification is an example. We propose a framework for constructing a large-scale synset graph and learning language representations with node embedding algorithms. We demonstrate that cross-lingual colexification patterns provide a distinct signal for modelling language similarity and predicting typological features. Our representations achieve a 9.97% performance gain in predicting lexico-semantic typological features and expectantly contain a weaker syntactic signal. This study is the first attempt to learn language representations and model language similarities using semantic typology at a large scale, setting a new direction for multilingual NLP, especially for low-resource languages.

Codes and Language Embeddings

MigrationsKB(MGKB) is a public Knowledge Base of anonymized Migration related annotated tweets. The MGKB currently contains over 200 thousand tweets, spanning over 9 years (January 2013 to July 2021), filtered with 11 European countries of the United Kingdom, Germany, Spain, Poland, France, Sweden, Austria, Hungary, Switzerland, Netherlands and Italy. Metadata information about the tweets, such as Geo information (place name, coordinates, country code). MGKB contains entities, sentiments, hate speeches, topics, hashtags, encrypted user mentions in RDF format. The schema of MGKB is an extension of TweetsKB for migrations related information. Moreover, to associate and represent the potential economic and social factors driving the migration flows such as eurostat, statista, etc. FIBO ontology was used. The extracted economic indicators, such as GDP Growth Rate, are connected with each Tweet in RDF using geographical and temporal dimensions. The user IDs and the tweet texts are encrypted for privacy purposes, while the tweet IDs are preserved.

Webpage Code DOI

Multilingual MigrationsKB

Multilingual MigrationskB (MGKB) is a mulitlingual extended version of English MGKB. The tweets geotagged with Geo location from 32 European Countries ( Austria, Belgium, Bulgaria, Croatia, Cyprus, Czech, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Poland, Portugal, Romania, Slovakia, Slovenia, Spain, Sweden, Iceland, Liechtenstein, Norway, Switzerland, the United Kingdom ) are extracted and filtered by 11 languages (English, French, Finnish, German, Greek, Dutch, Hungarian, Italian, Polish, Spain, Swedish). Metadata information about the tweets, such as Geo information (place name, coordinates, country code) are included.

Code DOI

This poster focuses on capturing the temporal evolution of migration-related topics on relevant tweets. It uses Dynamic Embedded Topic Model (DETM) as a learning algorithm to perform a quantitative and qualitative analysis of these emerging topics. TweetsKB is extended with the extracted Twitter dataset along with the results of DETM which considers temporality. These results are then further analyzed and visualized. It reveals that the trajectories of the migration-related topics are in agreement with historical events.


Interpretable Change Point Detection in Review Streams

Change point detection (CPD) has become increasingly important in time series analysis. There have been many well functioning algorithms for change point detection. However, little attention has been paid to evaluating the performance of change point detection algorithms on real-word time series, let alone natural language real-world time series. In this thesis, we create a dataset from real-world data of hotel reviews and use this dataset as ground truth to evaluate various CPD algorithms. To the best of our knowledge, this is the first dataset specifically designed to evaluate change point detection algorithms in natural language processing, which potentially provides a realistic benchmark dataset in this area.

AnnotationInterface codes masterthesis

Super Resolution Object Detection

Tiny Object Detection in Deep Learning.

poster codes

Combined Distributional and Formal Semantics

Seminar For Semantik II, Vecchi, CIS, LMU 2019

Presentation codes

Argumentation in Dialogues

Adress the following questions:

  • How can we extract argument segments in dialogues that clearly express a particular argument facet? (such as morality, Second Amendament)
  • How can we recognize that two argument segments are semantically similar, i.e., about the same facet of the argument?

Presentation codes

Deep Learning for Extraction of Opinion Entities

Sequence Labeling, BiLSTM(+CRF), CNN(+CRF)

codes disputation