Publications

Current publications of the Chair of Algorithms and Data Structures

Years: 2024 | 2023 | 2022 | 2021 | 2020 | 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009

2024

back to the year overview

Bast H, Brosi P
Large-Scale Generation of Transit Maps from OpenStreetMap Data
2024 The Cartographic Journal

Abstract We investigate the automatic generation of transit map overlays (either geographically correct or schematic) for the entire planet from OpenStreetMap (OSM) data. To achieve this, we first extract relevant transit line geometries (together with their line name, and, if present, colour) from OSM using SPARQL queries against an RDF version of the OSM data. The queries are run against our SPARQL engine QLever. In the second step, we build a global line network graph where every edge is labelled with the lines travelling through it. The components of this network graph are then rendered as a transit map using tools and methods developed in previous work. Final maps are delivered as vector tiles to an interactive web map, from which individual network graphs (in a GeoJSON format proposed in this work) can be downloaded for research purposes. The vector tiles are also freely available. We briefly describe the methods used in each pipeline step, evaluate the speed and quality of our approach and discuss possible shortcomings.

2023

back to the year overview

Bast H, Brosi P, Kalmbach J, Lehmann A
Efficient Interactive Visualization of Very Large Geospatial Query Results
2023 SIGSPATIAL

Abstract We present a web mapping application that offers interactive visualization of query results with hundreds of millions of geospatial objects. This is in contrast to existing applications, which are slow or unresponsive when the number of objects in the result is large. We describe a general technique, which works for any database engine that represents each geospatial object with a unique IDs and that can return a query result either with the objects or with the IDs. We have implemented a web mapping application using this technique and with the QLever SPARQL engine as backend. We evaluate it on queries on the complete OpenStreetMap (OSM) data, with result sizes ranging from small to very large. We compare it against the map interfaces of Overpass, PostGIS, and OSCAR
Bast H, Hertel M, Prange N
ELEVANT: A Fully Automatic Fine-Grained Entity Linking Evaluation and Analysis Tool
2023

2022

back to the year overview

Brosi P
A Toolchain for Generating Transit Maps from Schedule Data
2022 Schematic Mapping Workshop 2022

Abstract We give a quick overview over LOOM, a software toolchain which aims to render transit maps from raw schedule data fully automatically. The maps can either be geographically correct or schematic (using e.g. octilinear, orthoradial, or hexalinear layouts). To ensure extensibility, the toolchain is implemented as a Unix pipeline, with individual tools performing well-defined subtasks (e.g. extracting the network graph from schedule data, avoiding overlapping segments, finding a schematic layout, finding the optimal line ordering on the network segments, and rendering the map as an SVG file). An auxiliary tool to add missing geographical line courses to schedule data (pfaedle) is also presented.
Bast H, Kalmbach J, Klumpp T, Kramer F, Schnelle N
Efficient and Effective SPARQL Autocompletion on Very Large Knowledge Graphs
Materials
2022

Abstract We show how to achieve fast autocompletion for SPARQL queries on very large knowledge graphs. At any position in the body of a SPARQL query, the autocompletion suggests matching subjects, predicates, or objects. The suggestions are context-sensitive and ranked by their relevance to the part of the query already typed. The suggestions can be narrowed down by prefix search on the names and aliases of the desired subject, predicate, or object. All suggestions are themselves obtained via SPARQL queries. For existing SPARQL engines, these queries are impractically slow on large knowledge graphs. We present various algorithmic and engineering improvements of an open-source SPARQL engine such that these queries are executed efficiently. We evaluate a variety of suggestion methods on three large knowledge graphs, including the complete Wikidata (9.9B triples). We compare our results with two widely used SPARQL engines, Virtuoso and Blazegraph. Our code, benchmarks, and complete reproducibility materials are available on https://ad.cs.uni-freiburg.de/publications .

2021

back to the year overview

Bast H, Brosi P, Kalmbach J, Lehmann A
An Efficient RDF Converter and SPARQL Endpoint for the Complete OpenStreetMap Data
Materials
2021 SIGSPATIAL

Abstract We present osm2rdf, a tool for converting OpenStreetMap (OSM) data to RDF triples, along with an efficient SPARQL endpoint and a convenient user interface for formulating SPARQL queries on that data. Unlike previous tools, osm2rdf retains all data provided by OSM, including the complete object geometries. Optionally, the tool can output explicit triples realizing the spatial relations contains and intersects. We provide weekly updates of the data (for the whole planet and also per continent and per country) on https://osm2rdf.cs.uni-freiburg.de. The tool is publicly available on GitHub. The SPARQL endpoint is realized via the open-source SPARQL engine QLever. We extended QLever to enable the efficient geometric filtering of a result by a given axis-parallel rectangle. The QLever UI provides interactive context-sensitive autocompletion that helps constructing SPARQL queries without prior knowledge of the details of the data.
Bast H, Hertel M, Mostafa M. Mohamed
Tokenization Repair in the Presence of Spelling Errors
Materials
2021 CoNLL

Abstract We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it’s not part of the problem to correct them. For example, given: “Tispa per isabout token izaionrep air”, compute “Tis paper is about tokenizaion repair”. We identify three key ingredients of highquality tokenization repair, all missing from previous work: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present. Our methods also improve existing spell checkers by fixing not only more tokenization errors but also more spelling errors: once it is clear which characters form a word, it is much easier for them to figure out the correct word. We provide six benchmarks that cover three use cases (OCR errors, text extraction from PDF, human errors) and the cases of partially correct space information and all spaces missing. We evaluate our methods against the best existing methods and a non-trivial baseline. We provide full reproducibility under https://ad.cs.uni-freiburg.de/publications .
Bast H, Brosi P, Storandt S
Metro Maps on Flexible Base Grids
Materials
2021 17th International Symposium on Spatial and Temporal Databases (SSTD '21)

Abstract We present new generic methods to efficiently draw schematized metro maps for a wide variety of layouts, including octilinear, hexalinear, and orthoradial maps. The maps are drawn by mapping the input graph to a suitable grid graph. Previous work was restricted to regular octilinear grids. In this work, we investigate a variety of grids, including triangular grids and orthoradial grids. In particular, we also construct sparse grids where the local node density adapts to the input graph (e.g. octilinear Hanan grids, which we introduce in this work). For octilinear maps, this reduces the grid size by a factor of up to 5 compared to previous work, while still achieving close-to-optimal layouts. For many maps, this reduction also leads to up to 5 times faster solution times of the underlying optimization problem. We evaluate our approach on five maps. All octilinear maps can be computed in under 0.5 seconds, all hexalinear and orthoradial maps can be computed in under 2.5 seconds.
Bast H, Kalmbach J, Klumpp T, Kramer F, Schnelle N
Efficient SPARQL Autocompletion via SPARQL
Materials
2021 arXiv

Abstract We show how to achieve fast autocompletion for SPARQL queries on very large knowledge bases. At any position in the body of a SPARQL query, the autocompletion suggests matching subjects, predicates, or objects. The suggestions are context-sensitive in the sense that they lead to a non-empty result and are ranked by their relevance to the part of the query already typed. The suggestions can be narrowed down by prefix search on the names and aliases of the desired subject, predicate, or object. All suggestions are themselves obtained via SPARQL queries, which we call \emph{autocompletion queries}. For existing SPARQL engines, these queries are impractically slow on large knowledge bases. We present various algorithmic and engineering improvements of an existing SPARQL engine such that these autocompletion queries are executed efficiently. We provide an extensive evaluation of a variety of suggestion methods on three large knowledge bases, including Wikidata (6.9B triples). We explore the trade-off between the relevance of the suggestions and the processing time of the autocompletion queries. We compare our results with two widely used SPARQL engines, Virtuoso and Blazegraph. On Wikidata, we achieve fully sensitive suggestions with sub-second response times for over $90\%$ of a large and diverse set of thousands of autocompletion queries.

2020

back to the year overview

Bast H, Brosi P, Näther M
Similarity Classification of Public Transit Stations
2020 arXiv

Abstract We study the following problem: given two public transit station identifiers and , each with a label and a geographic coordinate, decide whether and describe the same station. For example, for "St Pancras International" at (51.5306, −0.1253) and "London St Pancras" at (51.5319, −0.1269), the answer would be "Yes". This problem frequently arises in areas where public transit data is used, for example in geographic information systems, schedule merging, route planning, or map matching. We consider several baseline methods based on geographic distance and simple string similarity measures.We also experiment with more elaborate string similarity measures and manually created normalization rules. Our experiments show that these baseline methods produce good, but not fully satisfactory results. We therefore develop an approach based on a random forest classifier which is trained on matching trigrams between two stations, their distance, and their position on an interwoven grid. All approaches are evaluated on extensive ground truth datasets we generated from OpenStreetMap (OSM) data: (1) The union of Great Britain and Ireland and (2) the union of Germany, Switzerland, and Austria. On all datasets, our learning based approach achieves an F1 score of over 99%, while even the most elaborate baseline approach (based on TFIDF scores and the geographic distance) achieves an F1 score of at most 94%, and a naive approach of using a geographical distance threshold achieves an F1 score of only 75%. Both our training and testing datasets are publicly available1.
Bast H, Brosi P, Näther M
staty: Quality Assurance for Public Transit Stations in OpenStreetMap
2020 Sigspatial

Abstract We present staty, a browser-based tool for quality assurance of public transit station tagging in OpenStreetMap (OSM). Building on the results of a similarity classifier for these stations, our tool visualizes name tag errors as well as incorrect and/or missing station group relations. Detailed edit suggestions are provided for individual objects. This is done intrinsically without an external ground truth. Instead, the underlying classifier is trained on the OSM data itself. We describe how our tool derives errors and suggestions from station tag similarities and provide experimental results on the OSM data of the United Kingdom, the United States, and a dataset consisting of Germany, Switzerland, and Austria. Our tool can be accessed under https://staty.cs.uni-freiburg.de.
Bast H, Hertel M, Mostafa M
Tokenization Repair in the Presence of Spelling Errors

2020 arXiv

Abstract We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces,correct these. Spelling errors can be present,but it’s not part of the problem to correct them. For example, given: “Tispa per is about tokenizaion repair”, compute “Tis paper is about tokenizaion repair”.It is tempting to think of this problem as a special case of spelling correction or to treat the two problems together. We make a case that tokenization repair and spelling correction should and can be treated as separate problems.We investigate a variety of neural models as well as a number of strong baselines. We identify three main ingredients to high-quality tokenization repair: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present.Our best methods can repair all tokenization errors on 97.5% of the correctly spelled test sentences and on 96.0% of the misspelled test sentences. With all spaces removed from the given text (the scenario from previous work),the accuracy falls to 94.5% and 90.1%, respectively. We conduct a detailed error analysis.
Bast H, Brosi P, Storandt S
Metro Maps on Octilinear Grid Graphs
2020 EuroVis Eurographics Conference on Visualization, volume: 39, issue: 3

Abstract Schematic transit maps (often called "metro maps" in the literature) are important to produce comprehensible visualizations ofcomplex public transit networks. In this work, we investigate the problem of automatically drawing such maps on an octilineargrid with an arbitrary (but optimal) number of edge bends. Our approach can naturally deal with obstacles that should berespected in the final drawing (points of interest, rivers, coastlines) and can prefer grid edges near the real-world course of aline. This allows our drawings to be combined with existing maps, for example as overlays in map services. We formulate aninteger linear program which can be used to solve the problem exactly. We also provide a fast approximation algorithm whichgreedily calculates shortest paths between node candidates on the underlying octilinear grid graph. Previous work used localsearch techniques to update node positions until a local optimum was found, but without guaranteeing octilinearity. We canthus calculate nearly optimal metro maps in a fraction of a second even for complex networks, enabling the interactive use ofour method in map editors

2019

back to the year overview

Bast H, Brosi P, Storandt S
Efficient Generation of Geographically Accurate Transit Maps
2019 ACM

Abstract We present LOOM (Line-Ordering OptimizedMaps), an automatic generator of geographically accurate transit maps. The input to LOOM is data about the lines of a transit network: for each line, its station sequence and geographical course. LOOM proceeds in three stages: (1) construct a line graph, where edges correspond to network segments with the same set of lines following the same course; (2) apply a set of local transformation rules that compute an optimal partial ordering of the lines and speed up the next stage; (3) construct an Integer Linear Program (ILP) that yields a line ordering for each edge and minimizes the total number of line crossings and line separations; and (4) based on the line graph and the computed line ordering, draw the map. As our maps respect the geography of the transit network, they can be used as overlays in typical map services. Previous research either did not take the network geography into account or was only concerned with schematic metro map layouting. We evaluate LOOM on six real-world transit networks, with line-ordering search-space sizes up to 2 × 10267. Using our transformation rules and an improved ILP formulation, we compute optimal line orderings in a fraction of a second for all networks. This enables interactive use of our method in map editors.

2018

back to the year overview

Bast H, Brosi P, Storandt S
Efficient Generation of Geographically Accurate Transit Maps
2018 SIGSPATIAL

Abstract We present LOOM (Line-Ordering Optimized Maps), a fully automatic generator of geographically accurate transit maps. The input to LOOM is data about the lines of a given transit network, namely for each line, the sequence of stations it serves and the geographical course the vehicles of this line take. We parse this data from GTFS, the prevailing standard for public transit data. LOOM proceeds in three stages: (1) construct a so-called line graph, where edges correspond to segments of the network with the same set of lines following the same course; (2) construct an ILP that yields a line ordering for each edge which minimizes the total number of line crossings and line separations; (3) based on the line graph and the ILP solution, draw the map. As a naive ILP formulation is too demanding, we derive a new custom-tailored formulation which requires significantly fewer constraints. Furthermore, we present engineering techniques which use structural properties of the line graph to further reduce the ILP size. For the subway network of New York, we can reduce the number of constraints from 229,000 in the naive ILP formulation to about 3,700 with our techniques, enabling solution times of less than a second. Since our maps respect the geography of the transit network, they can be used for tiles and overlays in typical map services. Previous research work either did not take the geographical course of the lines into account, or was concerned with schematic maps without optimizing line crossings or line separations.
Bast H, Brosi P
Sparse Map-Matching in Public Transit Networks with Turn Restrictions
2018 SIGSPATIAL

Abstract We investigate the following map-matching problem: given a sequence of stations taken by a public transit vehicle and given the underlying network, find the most likely geographical course taken by that vehicle.We provide a newalgorithm and tool, which is based on a hidden Markov model and takes characteristics of transit networks into account. Our tool can be useful for the visualization of transit lines in map services, for transit data providers, and for an on-line matching of live passenger GPS data to a public transit vehicle. We evaluate our tool on real-world data, and compare it against two baselines. The shapes produced by our tool are very close to the true shapes. We have made our software publicly available,enabling full reproducibility of our results.
Bast H, Schnelle N
Efficient and Convenient SPARQL+Text Search: A Quick Survey
2018 Reasoning Web International Summer School 2018 Springer, Cham, pages: 26 - 34

Abstract This is a quick survey about efficient search on a text corpus combined with a knowledge base. We provide a high-level description of two systems for searching such data efficiently. The first and older system, Broccoli, provides a very convenient UI that can be used without expert knowledge of the underlying data. The price is a limited query language. The second and newer system, QLever, provides an efficient query engine for SPARQL+Text, an extension of SPARQL to text search. As an outlook, we discuss the question of how to provide a system with the power of QLever and the convenience of Broccoli. Both Broccoli and QLever are also useful when only searching a knowledge base (without additional text).
Bast H, Azar Y, Herman G
26st European Symposium on Algorithms - LIPIcs
, volume: 112, 2018
Bast H, Buchhold B, Haussmann E
A Quality Evaluation of Combined Search on a Knowledge Base and Text
2018 Springer, pages: 19 - 26

Abstract We provide a quality evaluation of KB+Text search, a deep integration of knowledge base search and standard full-text search. A knowledge base (KB) is a set of subject–predicate–object triples with a common naming scheme. The standard query language is SPARQL, where queries are essentially lists of triples with variables. KB+Text search extends this by a special occurs-with predicate, which can be used to express the co-occurrence of words in the text with mentions of entities from the knowledge base. Both pure KB search and standard full-text search are included as special cases. We evaluate the result quality of KB+Text search on three different query sets. The corpus is the full version of the English Wikipedia (2.4 billion word occurrences) combined with the YAGO knowledge base (26 million triples). We provide a web application to reproduce our evaluation, which is accessible via http://ad.informatik. uni-freiburg.de/publications.

2017

back to the year overview

Bast H, Buchhold B
QLever: A Query Engine for Efficient SPARQL+Text Search
2017 CIKM

Abstract We present QLever, a query engine for efficient combined search on a knowledge base and a text corpus, in which named entities from the knowledge base have been identied (that is, recognized and disambiguated). The query language is SPARQL extended by two QLever-specific predicates ql:contains-entity and ql:contains-word, which can express the occurrence of an entity or word (the object of the predicate) in a text record (the subject of the predicate). We evaluate QLever on two large datasets, including FACC (the ClueWeb12 corpus linked to Freebase). We compare against three state-of-the-art query engines for knowledge bases with varying support for text search: RDF-3X, Virtuoso, Broccoli. Query times are competitive and often faster on the pure SPARQL queries, and several orders of magnitude faster on the SPARQL+Text queries. Index size is larger for pure SPARQL queries, but smaller for SPARQL+Text queries.
Daniel Bahrdt, Rick Gelhausen, Stefan Funke, Sabine Storandt
Searching OSM Planet with Context-Aware Spatial Relations
2017 ACM SIGSPATIAL Sigspatial, volume: 1

Abstract We consider the problem of indexing the complete OpenStreetMap planet data set (> 500GB of raw data) to support complex queries involving both text search as well as context-aware spatial relations. This requires (a) formalization of spatial relations like ’north of’, ’between’, ’near’ depending on the context, and (b) the development of suitable data representations to integrate textual and spatial information for efficient query performance.
Bast H, Brosi P, Storandt S
Efficient Generation of Geographically Accurate Transit Maps
2017

Abstract We present LOOM (Line-Ordering Optimized Maps), a fully automatic generator of geographically accurate transit maps. The input to LOOM is data about the lines of a given transit network, namely for each line, the sequence of stations it serves and the geographical course the vehicles of this line take. We parse this data from GTFS, the prevailing standard for public transit data. LOOM proceeds in three stages: (1) construct a so-called line graph, where edges correspond to segments of the network with the same set of lines following the same course; (2) construct an ILP that yields a line ordering for each edge which minimizes the total number of line crossings and line separations; (3) based on the line graph and the ILP solution, draw the map. As a naive ILP formulation is too demanding, we derive a new custom-tailored formulation which requires significantly fewer constraints. Furthermore, we present engineering techniques which use structural properties of the line graph to further reduce the ILP size. For the subway network of New York, we can reduce the number of constraints from 229,000 in the naive ILP formulation to about 4,500 with our techniques, enabling solution times of less than a second. Since our maps respect the geography of the transit network, they can be used for tiles and overlays in typical map services. Previous research work either did not take the geographical course of the lines into account, or was concerned with schematic maps without optimizing line crossings or line separations.
Bast H, Korzen C
A Benchmark and Evaluation for Text Extraction from PDF
2017

Abstract Extracting the body text from a PDF document is an important but surprisingly difficult task. The reason is that PDF is a layout-based format which specifies the fonts and positions of the individual characters rather than the semantic units of the text (e.g., words or paragraphs) and their role in the document (e.g., body text or caption). There is an abundance of extraction tools, but their quality and the range of their functionality are hard to determine. In this paper, we show how to construct a high-quality benchmark of principally arbitrary size from parallel TeX and PDF data. We construct such a benchmark of 12,098 scientific articles from arXiv.org and make it publicly available. We establish a set of criteria for a clean and independent assessment of the semantic abilities of a given extraction tool. We provide an extensive evaluation of 14 state-of-the-art tools for text extraction from PDF on our benchmark according to our criteria. We include our own method, Icecite, which significantly outperforms all other tools, but is still not perfect. We outline the remaining steps necessary to finally make text extraction from PDF a "solved problem".
Bast H, Buchhold B, Haussmann E, Heindorf S, Potthast M
WDSM Cup 2017: Vandalism Detection and Triple Scoring
2017

Abstract The WSDM Cup 2017 was a data mining challenge held in conjunction with the 10th International Conference on Web Search and Data Mining (WSDM). It addressed key challenges of knowledge bases today: quality assurance and entity search. For quality assurance, we tackle the task of vandalism detection, based on a dataset of more than 82 million user-contributed revisions of the Wikidata knowledge base, all of which annotated with regard to whether or not they are vandalism. For entity search, we tackle the task of triple scoring, using a dataset that comprises relevance scores for triples from type-like relations including occupation and country of citizenship, based on about 10,000 human relevance judgments. For reproducibility sake, participants were asked to submit their soft-ware on TIRA, a cloud-based evaluation platform, and they were incentivized to share their approaches open source.
Bast H, Buchhold B, Haussmann E
A Quality Evaluation of Combined Search on a Knowledge Base and Text
2017 Künstliche Intelligenz

Abstract We provide a quality evaluation of KB+Text search, a deep integration of knowledge base search and standard full-text search. A knowledge base (KB) is a set of subject-predicate-object triples with a common naming scheme. The standard query language is SPARQL, where queries are essentially lists of triples with variables. KB+Text search extends this by a special "occurs-with" predicate, which can be used to express the co-occurrence of words in the text with mentions of entities from the knowledge base. Both pure KB search and standard full-text search are included as special cases. We evaluate the result quality of KB+Text search on three different query sets. The corpus is the full version of the English Wikipedia (2.4 billion word occurrences) combined with the YAGO knowledge base (26 million triples). We provide a web application to reproduce our evaluation, which is accessible via http://ad.informatik.uni-freiburg.de/publications.
Bast H, Buchhold B, Haussmann E
Overview of the Triple Scoring Task at the WSDM Cup 2017
2017

Abstract This paper provides an overview of the triple scoring task at the WSDM Cup 2017, including a description of the task and the dataset, an overview of the participating teams and their results, and a brief account of the methods employed. In a nutshell, the task was to compute relevance scores for knowledge-base triples from relations, where such scores make sense. Due to the way the ground truth was constructed, scores were required to be integers from the range 0..7. For example, reasonable scores for the triples Tim Burton profession Director and Tim Burton profession Actor would be 7 and 2, respectively, because Tim Burton is well-known as a director, but he acted only in a few lesser known movies. The triple scoring task attracted considerable interest, with 52 initial registrations and 21 teams who submitted a valid run before the deadline. The winning team achieved an accuracy of 87%, that is, for that fraction of the triples from the test set (which was revealed only after the deadline) the difference to the score from the ground truth was at most 2. The best result for the average difference from the test set scores was 1.50.
Haussmann E
Towards Precise and Convenient Semantic Search on Text and Knowledge Bases
2017

Abstract In this dissertation, we consider the problem of making semantic search on text and knowledge bases more precise and convenient. In a nutshell, semantic search is search with meaning . To this respect, text and knowledge bases have different advantages and disadvantages. Large amounts of text are easily available on the web, and they contain a wealth of information in natural language. However, text represents information in an unstructured form. It follows no pre-defined schema, and without further processing, a machine can understand its meaning only on a superficial level. Knowledge bases, on the other hand, contain structured information in the form of subject predicate object triples. The meaning of triples is well defined, and triples can be retrieved precisely via a query language. However, ormulating queries in this language is inconvenient and compared to text only a small fraction of information is currently available in knowledge bases. In this document, we summarize our contributions on making semantic search on text and knowledge bases more precise and convenient. For knowledge bases, we introduce an approach to answer natural language questions. A user can pose questions conveniently in natural language and ask, for example, who is the ceo of apple? , instead of having to learn and use a specific query language. Our approach applies learning-to-rank strategies and improved the state of the art on two widely used benchmarks at the time of publication. For knowledge bases, we also describe a novel approach to compute relevance scores for triples from type-like relations like profession and nationality . For example, on a large knowledge base, a query for american actors can return a list of more than 60 thousand actors in no particular order. Relevance scores allow to sort this list so that, e.g., frequent lead actors appear before those who only had single cameo roles. In a benchmark that we generated via crowdsourcing, we show that our rankings are closer to human judgments than approaches from the literature. Finally, for text, we introduce a novel natural language processing technique that identifies which words in a sentence “semantically belong together”. For example, in the sentence Bill Gates, founder of Microsoft, and Jeff Bezos, founder of Amazon, are among the wealthiest persons in the world , the words Bill Gates ,founder , and Amazon do not belong together, but the words Bill Gates ,founder, and Microsoft do. We show that when query keywords are required to belong together in order to match, search results become more precise. Given the characteristics of text and knowledge bases outlined above, it is promising to consider a search that combines both. For example, for the query CEOs of U.S. companies who advocate cryptocurrencies , a list of CEOs of U.S. companies can be retrieved from a knowledge base. The information who is advocating cryptocurrencies is rather specific and changes frequently. It is, therefore, better found in full text. As part of this thesis, we describe how a combined search could be achieved and present and evaluate a fully func- tional prototype. All of our approaches are accompanied by an extensive evaluation which show their practicability and, where available, compare them to established approaches from the literature.

2016

back to the year overview

Bast H, Buchhold B, Haussmann E
Semantic Search on Text and Knowledge Bases
2016

Abstract This article provides a comprehensive overview of the broad area of semantic search on text and knowledge bases. In a nutshell, semantic search is “search with meaning”. This “meaning” can refer to various parts of the search process: understanding the query (instead of just finding matches of its components in the data), understanding the data (instead of just searching it for such matches), or representing knowledge in a way suitable for meaningful retrieval. Semantic search is studied in a variety of different communities with a variety of different views of the problem. In this survey, we classify this work according to two dimensions: the type of data (text, knowl- edge bases, combinations of these) and the kind of search (keyword, structured, natural language). We consider all nine combinations. The focus is on fundamental techniques, concrete systems, and benchmarks. The survey also considers advanced issues: ranking, indexing, ontology matching and merging, and inference. It also provides a succinct overview of natural language processing techniques that are useful for semantic search: POS tagging, named-entity recognition and disambiguation, sentence parsing, and word vectors. The survey is as self-contained as possible, and should thus also serve as a good tutorial for newcomers to this fascinating and highly topical field.
Bast H, Storandt S, Hertel M
Scalable Transfer Patterns
2016 ALENEX

Abstract We consider the problem of Pareto-optimal route planning in public-transit networks of a whole country, a whole continent, or even the whole world. On such large networks, existing approaches suffer from either a very large space consumption, a very long preprocessing time or slow query processing. Transfer Patterns, a state-of-the-art technique for route planning in transit networks, achieves excellent query times, but the space consumption is large and the preprocessing time is huge. In this paper, we introduce a new scheme for the Transfer Pattern precomputation and query graph construction that reduces both the necessary preprocessing time and space consumption by an order of magnitude and more. Average query times are below 1 ms for local queries, independent of the size of the network, around 30 ms for non-local queries on the complete transit network of Germany, and an estimated 200 ms for a ﬁctitious transit network covering the currently available data of the whole world.
Bast H, Delling D, Goldberg A, Müller-Hannemann M, Pajor T, Sanders P, Wagner D, Werneck R
Route Planning in Transportation Networks
In: Algorithm Engineering
2016, Springer, Lasse Kliemann, Peter Sanders, pages: 19 - 80, Lasse Kliemann, Peter Sanders, ISBN: 978-3-319-49486-9

Abstract We survey recent advances in algorithms for route planning in transportation networks. For road networks, we show that one can compute driving directions in milliseconds or less even at continental scale. A variety of techniques provide different trade-offs between preprocessing effort, space requirements, and query time. Some algorithms can answer queries in a fraction of a microsecond, while others can deal efficiently with real-time traffic. Journey planning on public transportation systems, although conceptually similar, is a significantly harder problem due to its inherent time-dependent and multicriteria nature. Although exact algorithms are fast enough for interactive queries on metropolitan transit systems, dealing with continent-sized instances requires simplifications or heavy preprocessing. The multimodal route planning problem, which seeks journeys combining schedule-based transportation (buses, trains) with unrestricted modes (walking, driving), is even harder, relying on approximate solutions even for metropolitan inputs.

2015

back to the year overview

Funke S, Storandt S
Provable Efficiency of Contraction Hierarchies with Randomized Preprocessing
2015 ISAAC

Abstract Abstract. We present a new way of analyzing Contraction Hierarchies (CH), a widely used speed-up technique for shortest path computations in road networks. In previous work, preprocessing and query times of deterministically constructed CH on road networks with n nodes were shown to be polynomial in n as well as the highway dimension h of the network and its diameter D. While h is conjectured to be polylogarithmic for road networks, a tight bound remains an open problem. We rely on the empirically justiﬁable assumption of the road network exhibiting small growth. We introduce a method to construct randomized Contraction Hierarchies on road networks as well as a probabilistic query routine. Our analysis reveals that randomized CH lead to sublinear search space sizes in the order of √n log √n, auxiliary data in the order of n log²√n,and correct query results with high probability after a polynomial time preprocessing phase.
Storandt S
Approximation Algorithms in the Successive Hitting Set Model
2015 ISAAC

Abstract Abstract. We introduce the successive Hitting Set model, where the set system is not given in advance but a set generator produces the sets that contain a speciﬁc element from the universe on demand. Despite incomplete knowledge about the set system, we show that several approximation algorithms for the conventional Hitting Set problem can be adopted to perform well in this model. We describe, and experimentally investigate, several scenarios where the new model is beneﬁcial compared to the conventional one.
Bast H, Storandt S, Weidner S
Fine-Grained Population Estimation
2015 GIS

Abstract We show how to estimate population numbers for arbitrary user-deﬁned regions, down to the level of individual buildings. This is important for various applications like evacuation planning, facility placement, or traffic estimation. However, census data with precise population numbers is typically only available at the level of cities, villages, or districts, if at all. Previous approaches either rely on available census data for already small areas or on sophisticated input data like high resolution aerial images. Our framework uses only freely available data, in particular, OpenStreetMap data. In the OpenStreetMap project, crowd-sourced data is collected about street networks, buildings, places of interest as well as all kind of regions and natural structures world-wide. We use this data to learn three classiﬁers that are relevant for the population distribution inside an area: residential vs. in-dustrial vs. commercial landuse, inhabited vs. uninhabited buildings, and single-family vs. multi-family houses. Once learned, we can use these classiﬁers for population estimation even in areas without any census data at all. Our experiments show good average accuracy (measured as the deviation from actual census data) for rural areas (25%), metropolitan areas (10%), and cities in countries other than that containing the training data (12%).
Storandt S, Funke S
Personalized Route Planning in Road Networks
2015 GIS

Abstract Computing shortest paths in road networks with millions of nodes and edges is challenging on its own. In the last few years, several preprocessing-based acceleration techniques have been developed to enable query answering orders of magnitudes faster than a plain Dijkstra computation. But most of these techniques work only if the metric which de-termines the optimal path is static or rarely changes. In con-trast to that, we aim at answering personalized route plan-ning queries. Here, every single query comes with a speciﬁ-cation of its very own metric. This increases the combina-torial complexity of the problem signiﬁcantly. We develop new preprocessing schemes that allow for real-time personal-ized route planning in huge road networks while keeping the memory footprint of the preprocessed data and subsequent queries small.
Bast H, Haussmann E
More Accurate Question Answering on Freebase
Materials
2015 CIKM

Abstract Real-world factoid or list questions often have a simple structure, yet are hard to match to facts in a given knowledge base due to high representational and linguistic variability. For example, to answer "who is the ceo of apple" on Freebase requires a match to an abstract "leadership" entity with three relations "role", "organization" and "person", and two other entities "apple inc" and "managing director". Recent years have seen a surge of research activity on learning-based solutions for this method. We further advance the state of the art by adopting learning-to-rank methodology and by fully addressing the inherent entity recognition problem, which was neglected in recent works. We evaluate our system, called Aqqu, on two standard benchmarks, Free917 andWebQuestions, improving the previous best result for each benchmark considerably. These two benchmarks exhibit quite different challenges, and many of the existing approaches were evaluated (and work well) only for one of them. We also consider efficiency aspects and take care that all questions can be answered interactively (that is, within a second). Materials for full reproducibility are available on our website: http://ad.informatik. uni-freiburg.de/publications .
Storandt S, Funke S, Schirrmeister R
Automatic Extrapolation of Missing Road Network Data in OpenStreetMap
2015 MUD@ICML

Abstract Road network data from OpenStreetMap (OSM) is the basis of various real-world applications such as fleet management or traffic flow estima- tion, and has become a standard dataset for re- search on route planning and related subjects. The quality of such applications and conclusive- ness of research crucially relies on correctness and completeness of the underlying road network data. We introduce methods for automatic de- tection of gaps in the road network and extrapo- lation of missing street names by learning topo- logical and semantic characteristics of road net- works. Our experiments show that with the help of the learned data, the quality of the OSM road network data can indeed be improved.
Bast H, Buchhold B, Haussmann E
Relevance Scores for Triples from Type-Like Relations
Materials
2015 SIGIR .

Abstract We compute and evaluate relevance scores for knowledge-base triples from type-like relations. Such a score measures the degree to which an entity “belongs” to a type. For example, Quentin Tarantino has various professions, including Film Director, Screenwriter, and Actor. The first two would get a high score in our setting, because those are his main professions. The third would get a low score, because he mostly had cameo appearances in his own movies. Such scores are essential in the ranking for entity queries, e.g. “american actors” or “quentin tarantino professions”. These scores are different from scores for "correctness" or "accuracy" (all three professions above are correct and accurate). We propose a variety of algorithms to compute these scores. For our evaluation we designed a new benchmark, which includes a ground truth based on about 14K human judgments obtained via crowdsourcing. Inter-judge agreement is slightly over 90%. Existing approaches from the literature give results far from the optimum. Our best algorithms achieve an agreement of about 80% with the ground truth.
Funke S, Schirrmeister R, Skilevic S, Storandt S
Compass-Based Navigation in Street Networks
2015 W2GIS

Abstract We present a new method for navigating in a street network using solely data acquired by a (smartphone integrated electronic) compass for self-localization. To make compass-based navigation in street networks practical, it is crucial to deal with all kinds of imprecision and different driving behaviors. We therefore develop a trajectory representation based on so-called in ection points which turns out to be very robust against measurement variability. To enable real-time localization with compass data, we construct a custom-tailored data structure inspired by algorithms for effcient pattern search in large texts. Our experiments reveal that on average already very short sequences of inflection points are unique in a large street network, proving that this representation allows for accurate localization.
Funke S, Nusser A, Storandt S
On k-Path Covers and their applications
2015 Vldb J

Abstract Abstract For a directed graph G with vertex set V ,wecalla subset C ⊆ V a k-(All-)Path Cover if C contains a node from any simple path in G consisting of k nodes. This paper con-siders the problem of constructing small k-Path Covers in the context of road networks with millions of nodes and edges. In many application scenarios, the set C and its induced overlay graph constitute a very compact synopsis of G, which is the basis for the currently fastest data structure for personalized shortest path queries, visually pleasing overlays of subsam-pled paths, and efﬁcient reporting, retrieval and aggregation of associated data in spatial network databases. Apart from a theoretic investigation of the problem, we provide efﬁcient algorithms that produce very small k-Path Covers for large real-world road networks (with a posteriori guarantees via instance-based lower bounds). We also apply our algorithms to other (social, collaboration, web, etc.) networks and can improve in several instances upon previous approaches.
Storandt S, Funke S, Nusser A
Placement of Loading Stations for Electric Vehicles: No Detours Necessary!
2015 Journal of Artificial Intelligence, issue: 08

Abstract Compared to conventional cars, electric vehicles (EVs) still suffer from considerably shorter cruising ranges. Combined with the sparsity of battery loading stations, the com-plete transition to E-mobility still seems a long way to go. In this paper, we consider the problem of placing as few loading stations as possible so that on any shortest path there are suﬃciently many not to run out of energy. We show how to model this problem and introduce heuristics which provide close-to-optimal solutions even in large road networks.
Storandt S
Route Planning for Electric Vehicles: Taking Energy Efficiency, Distance, and Reloading Opportunities into Account
2015 Lecture Notes in Informatics

2014

back to the year overview

Haussmann E
An Ensemble for Query Intent Detection
2014 CKIM CUP
Bast H, Bäurle F, Buchhold B, Haussmann E
Easy Access to the Freebase Dataset
2014 WWW, pages: 95 - 98

Abstract We demonstrate a system for fast and intuitive exploration of the Freebase dataset. This required solving several non- trivial problems, including: entity scores for proper anking and name disambiguation, a unique meaningful name for every entity and every type, extraction of canonical binary relations from multi-way relations (which in Freebase are modeled via so-called mediator objects), computing the transitive hull of selected relations, and identifying and merging duplicates. Our contribution is two-fold. First, we provide for download an up-to-date version of the Freebase data, enriched and simplified as just sketched. Second, we offer a user interface for exploring and searching this data set. The data set, the user interface and a demo video are available from http://freebase-easy.cs.uni-freiburg.de.
Bast H, Celikik M
Efficient Index-Based Snippet Generation
2014 Acm T Inform Syst, volume: 32, issue: 2, supplement: 6, pages: 1 - 24

Abstract Ranked result lists with query-dependent snippets have become state of the art in text search. They are typically implemented by searching, at query time, for occurrences of the query words in the top-ranked documents. This document-based approach has three inherent problems: (i) when a document is indexed by terms which it does not contain literally (e.g., related words or spelling variants), localization of the corresponding snippets becomes problematic; (ii) each query operator (e.g., phrase or proximity search) has to be implemented twice, on the index side in order to compute the correct result set, and on the snippet generation side to generate the appropriate snippets; and (iii) in a worst case, the whole document needs to be scanned for occurrences of the query words, which could be problematic for very long documents. We present a new index-based method that localizes snippets by information solely computed from the index and that overcomes all three problems. Unlike previous index-based methods, we show how to achieve this at essentially no extra cost in query processing time, by a technique we call operator inversion. We also show how our index-based method allows the caching of individual segments instead of complete documents, which enables a significantly larger cache hit-ratio as compared to the document-based approach. We have fully integrated our implementation with the CompleteSearch engine.
Bast H, Storandt S
Flow-Based Guidebook Routing
2014 ALENEX

Abstract Public-transportation route-planning systems typically work as follows. The user specifies a source and a target location, as well as a departure time. The system then returns one or more optimal trips at or after that departure time. In this paper, we consider guidebook routing, where the goal is to provide time-independent answers that are valid over long periods of time. An example answer could be: Take Bus 10 to the main station, from there take Tram 11 or 13 (whichever comes next) to your target station. Trip duration: 30 minutes. Frequency: every 20 minutes. Valid: weekdays from 6am - 8pm. We show how to compute such guidebook routes efficiently and with provably good quality. An evaluation on real-world data shows that few guidebook routes usually suffice for good coverage. We also show how guidebook routing can be used to speed up transfer patterns, a state-of-the-art method for public transportation routing.
Bast H, Sternisko J, Storandt S
ForestMaps: A Computational Model and Visualization for Forest Utilization
2014 W2GIS

Abstract We seek to compute utilization information for public spaces, in particular forests: which parts are used by how many people. Our contribution is threefold. First, we present a sound model for computing this information from publicly available data such as road maps and population counts. Second, we present efficient algorithms for computing the desired utilization information according to this model. Third, we provide an experimental evaluation with respect to both efficiency and quality, as well as an interactive web application, that visualizes our result as a heat-map layer on top of OpenStreetMap data. The link to our web application can be found under http://forestmaps.informatik.uni-freiburg.de .
Bast H, Brosi P, Storandt S
Frequency-Based Search for Public Transit
2014 SIGSPATIAL

Abstract We consider the application of route planning in large publictransportation networks (buses, trains, subways, etc). Many connections in such networks are operated at periodic time intervals. When a set of connections has sufficient periodicity,it becomes more efficient to store the time range and frequency (e.g., every 15 minutes from 8:00am - 6:00pm) instead of storing each of the time events separately. Identifying an optimal frequency-compression is NP-hard, so we present a time- and space-efficient heuristic. We show how we can use this compression to not only save file queries, which ask for all optimal routes with departure times in a given interval (e.g., a whole day). In particular, we design a new version of Dijkstra's algorithm that works with frequency-based labels and is suitable for profile queries. We evaluate the savings of our approach on two metropolitan and three country-wide public-transportation networks. On our largest network, we simultaneously achieve a better space consumption than all previous methods as well as profile query times that are about 5 times faster than the best previous method. We also improve Transfer Patterns, a state-of-the-art technique for fully realistic route planning in large public-transportation networks. In particular, we accelerate the expensive preprocessing by a factor of 60 compared to the original publication.
Bast H, Haussmann E
More Informative Open Information Extraction via Simple Inference
Demo
2014 ECIR

Abstract Recent Open Information Extraction (OpenIE) systems utilize grammatical structure to extract facts with very high recall and good precision. In this paper, we point out that a significant fraction of the extracted facts is, however, not informative. For example, for the sentence 'The ICRW is a non-profit organization headquartered in Washington', the extracted fact (a non-profit organization) (is headquartered in) (Washington) is not informative. This is a problem for semantic search applications utilizing these triples, which is hard to fix once the triple extraction is completed. We therefore propose to integrate a set of simple inference rules into the extraction process. Our evaluation shows that, even with these simple rules, the percentage of informative triples can be improved considerably and the already high recall can be improved even further. Both improvements directly increase the quality of search on these triples.
Bast H, Brosi P, Storandt S
Real-Time Movement Visualization of Public Transit Data
2014 SIGSPATIAL

Abstract We introduce a framework to create a world-wide live map of public transit, i.e. the real-time movement of all buses, sub-ways, trains and ferries. Our system is based on freely available General Transit Feed Specification (GTFS) timetable data and also features real-time delay information (where available). The main problem of such a live tracker is the enormous amount of data that has to be handled (millions of vehicle movements). We present a highly efficient back-end that accepts temporal and spatial boundaries and returns all relevant trajectories and vehicles in a format that allows for easy rendering by the client. The real-time movement visualization of complete transit networks allows to observe the current state of the system, to estimate the transit coverage of certain areas, to display delays in a neat manner, and to inform a mobile user about near-by vehicles. Our system can be accessed via http://tracker.geops.ch/. The current implementation features over 80 transit networks, including the complete Netherlands (with real-time delay data), and various metropolitan areas in the US, Europe, Australia and New Zealand. We continuously integrate new data. Especially for Europe and North America we expect to achieve almost full coverage soon.
Bast H, Bäurle F, Buchhold B, Haussmann E
Semantic full-text search with broccoli
2014 SIGIR, pages: 1265 - 1266

Abstract We combine search in triple stores with full-text search into what we call semantic full-text search. We provide a fully functional web application that allows the incremental construction of complex queries on the English Wikipedia combined with the facts from Freebase. The user is guided by context-sensitive suggestions of matching words, instances, classes, and relations after each keystroke. We also provide a powerful API, which may be used for research tasks or as a back end, e.g., for a question answering system. Our web application and public API are available under http://broccoli.cs.uni-freiburg.de.
Bast H, Brosi P, Storandt S
TRAVIC: A Visualization Client for Public Transit Data
2014 SIGSPATIAL

Abstract We present TRAVIC, a thin browser-based client that is able to display smooth vehicle movements on a map. The focus is on visualizing world-wide public transit vehicle movements in an interactive way. But we also investigate other use cases, for example, traffc simulation. We describe in detail which server requests are fired and how the received data is handled. We also provide a performance evaluation conducted on several browsers. We show that, in combination with an efficient back-end, TRAVIC is able to display many thousands of vehicle movements in realtime. Our prototype implementation can be accessed under http://tracker.geops.ch .

2013

back to the year overview

Bast H
Algorithmic Problems in Semantic Search
2013 ESA
Bast H, Buchhold B
An Index for Efficient Semantic Full-Text Search
2013 CIKM
Storandt S
Contraction Hierarchies on Grid Graphs
2013 KI

Abstract Many speed-up techniques developed for accelerating the computation of shortest paths in road networks, like reach or contraction hierarchies, are based on the property that some streets are ’more important’ than others, e.g. on long routes the usage of an interstate is almost inevitable. In grids there is no obvious hierarchy among the edges, especially if the costs are uniform. Nevertheless we will show that contraction hierarchies can be applied to grid graphs as well. We will point out interesting connections to speed-up techniques shaped for routing on grids, like swamp hierarchies and jump points, and provide experimental results for game maps, mazes, random grids and rooms.
Schnelle N, Funke S, Storandt S
DORC: Distributed Online Route Computation – Higher Throughput, more Privacy
2013 PerCOM

Abstract We propose a technique to distribute the work- load of online route planners as offered for example by Bing/Google/Yahoo Maps, etc. among the clients requesting the routes. Our scheme not only increases the throughput of a server answering the requests of clients but also yields a simple way of providing some degree of privacy for the user. A prototype implementation of our system is available as an Android app in Google Play and on Github.
Bast H, Sternisko J, Storandt S
Delay-Robustness of Transfer Patterns in Public Transportation Route Planning
Materials
2013 ATMOS

Abstract Transfer pattern routing is a state-of-the-art speed-up technique for finding optimal paths which minimize multiple cost criteria in public transportation networks. It precomputes sequences of transfer stations along optimal paths. At query time, the optimal paths are searched among the stored transfer patterns, which allows for very fast response times even on very large networks. On the other hand, even a minor change to the timetables may affect many optimal paths, so that, in principle, a new computation of all optimal transfer patterns becomes necessary. In this paper, we examine the robustness of transfer pattern routing towards delay, which is the most common source of such updates. The intuition is that the deviating paths caused by typical updates are already covered by original transfer patterns. We perform experiments which show that the transfer patterns are remarkably robust even to large and many delays, which underlines the applicability and reliability of transfer pattern routing in realistic routing applications.
Bast H, Celikik M
Efficient fuzzy search in large text collections
2013 Acm T Inform Syst, volume: 31, issue: 2

Abstract We consider the problem of fuzzy full-text search in large text collections, that is, full-text search which is robust against errors both on the side of the query as well as on the side of the documents. Standard inverted-index techniques work extremely well for ordinary full-text search but fail to achieve interactive query times (below 100 milliseconds) for fuzzy full-text search even on moderately-sized text collections (above 10 GBs of text). We present new preprocessing techniques that achieve interactive query times on large text collections (100 GB of text, served by a single machine). We consider two similarity measures, one where the query terms match similar terms in the collection (e.g., algorithm matches algoritm or vice versa) and one where the query terms match terms with a similar prefix in the collection (e.g., alori matches algorithm). The latter is important when we want to display results instantly after each keystroke (search as you type). All algorithms have been fully integrated into the CompleteSearch engine.
Storandt S, Funke S
Enabling Emobility: Facility Location for Battery Loading Stations
2013 AAAI

Abstract The short cruising range due to the limited battery supply of current Electric Vehicles (EVs) is one of the main obstacles for a complete transition to E-mobility. Until batteries of higher energy storage density have been developed, it is of utmost importance to deliberately plan the locations of new loading stations for best possible coverage. Ideally the net- work of loading stations should allow driving from anywhere to anywhere (and back) without running out of energy. We show that minimizing the number of necessary loading sta- tions to achieve this goal is NP-hard and even worse, we can rule out polynomial-time constant approximation algorithms. Hence algorithms with better approximation guarantees have to make use of the special structure of road networks (which is not obvious how to do it). On the positive side, we show with instance based lower bounds that our heuristic algorithms achieve provably good solutions on real-world problem instances.
Bast H, Haussmann E
Open Information Extraction via Contextual Sentence Decomposition
2013 ICSC
Funke S, Storandt S
Polynomial-time Construction of Contraction Hierarchies for Multi-criteria Objectives
2013 ALENEX

Abstract We consider multicriteria shortest path problems and show that contraction hierarchies – a very powerful speed-up technique originally developed for standard shortest path queries – can be constructed efficiently for the case of arbitrary conic combinations of the edge costs. This extends previous results which considered only the bicriteria case and discrete weights for the objective functions. On the theory side we prove a polynomial time bound for determining whether a path π is part of the lower envelope of all pareto-optimal paths via some polyhedral arguments. Experiments complement these results by showing the practicability of our approach.
Bast H, Brodesser M, Storandt S
Result Diversity for Multi-Modal Route Planning
2013 ATMOS

Abstract We study multi-modal route planning allowing arbitrary (meaningful) combinations of public transportation, walking, and taking a car / taxi. In the straightforward model, the number of Pareto-optimal solutions explodes. It turns out that many of them are similar to each other or unreasonable. We introduce a new filtering procedure, Types aNd Thresholds (TNT), which leads to a small yet representative subset of the reasonable paths. We consider metropolitan areas like New York, where a fast computation of the paths is difficult. To reduce the high compu- tation times, optimality-preserving and heuristic approaches are introduced. We experimentally evaluate our approach with respect to result quality and query time. The experiments confirm that our result sets are indeed small (around 5 results per query) and representative (among the reasonable Pareto-optimal paths), and with average query times of about one second or less.
Bast H
Semantische Suche
2013 Informatik Spektrum (INSK), volume: 36, issue: 2, pages: 136 - 143
Bast H, Korzen C
The Icecite Research Paper Management System
Presentation
2013 WISE, volume: 8181, pages: 396 - 409

Abstract We present Icecite, a new fully web-based research paper management system (RPMS). Icecite facilitates the following otherwise laborious and time-consuming steps typically involved in literature research: automatic metadata and reference extraction, on-click reference downloading, shared annotations, offline availability, and full-featured search in metadata, full texts, and annotations. None of the many existing RPMSs provides this feature set. For the metadata and reference extraction, we use a rule-based approach combined with an index-based approximate search on a given reference database. An extensive quality evaluation, using DBLP and PubMed as reference databases, shows extraction accuracies of above 95%. We also provide a small user study, comparing Icecite to the state-of-the-art RPMS Mendeley as well as to an RPMS-free baseline.

2012

back to the year overview

Bast H, Bäurle F, Buchhold B, Haussmann E
A Case for Semantic Full-Text Search
2012 SIGIR-JIWES
Bast Hannah, Bäurle Florian, Buchhold Björn, Haussmann Elmar
Broccoli: Semantic Full-Text Search at your Fingertips
2012 CoRR, volume: abs/1207.2615
Celikik M
Efficient Error-Tolerant Search on Large Text Collections
2012

Abstract In this dissertation, we consider the problem of fuzzy full-text search and query suggestion in large text collections, that is, full-text search that is robust against errors on the side of the query as well as errors on the side of the documents. We consider two variants of the problem. The first variant is keyword-based search tolerant to errors. The second variant is autocompletion or prefix search tolerant to errors. In this variant of the problem, each keyword can be specified partially and the results appear instantly as the user types the query letter by letter. One of the main challenges in building an information retrieval system for fuzzy search is efficiency. Providing interactive query times (below 100 ms) for either fuzzy search variant is surprisingly challenging due to the one order of magnitude larger volume of data to be handled by the system. While efficient index data structures exist that allow fast search for the exact variants of the problem, there has been limited work on indexes that tackle fuzzy search. Commercial search engines such as Yahoo!, Google and Bing provide error-tolerance to certain extent thanks to the large amount of available query log data. In our version of the problem, we assume a complete absence of query logs or any other precomputed information. This assumption is often realistic for information retrieval systems for vertical or domain-specific search that typically have a much smaller user base. In the first part of this dissertation, we propose efficient data structures and algorithms that are the core of our fuzzy search. In the second part, we address important algorithm-engineering aspects of an error-tolerant search system. All of our algorithms and data structures have been implemented and integrated into the CompleteSearch engine.
Aslam Faisal, Baig Ghufran, Qureshi Mubashir, Uzmi Zartash, Fennell Luminous, Thiemann Peter, Schindelhauer Christian, Haussmann Elmar
Rethinking Java call stack design for tiny embedded devices
2012 LCTES, pages: 1 - 10

2011

back to the year overview

Bast Hannah, Celikik Marjan
Fast construction of the HYB index
2011 ACM Trans. Inf. Syst., volume: 29, issue: 3

2010

back to the year overview

Bast Hannah, Celikik Marjan
Efficient two-sided error-tolerant search
2010 KEYS
Bast Hannah, Carlsson Erik, Eigenwillig Arno, Geisberger Robert, Harrelson Chris, Raychev Veselin, Viger Fabien
Fast Routing in Very Large Public Transportation Networks Using Transfer Patterns
2010 ESA (1), pages: 290 - 301
Aslam Faisal, Fennell Luminous, Schindelhauer Christian, Thiemann Peter, Ernst Gidon, Haussmann Elmar, Rührup Stefan, Uzmi Zartash
Optimized Java Binary and Virtual Machine for Tiny Motes
2010 DCOSS, pages: 15 - 30
Veltkamp Remco, Giezeman Geert-Jan, Bast Hannah, Baumbach Thomas, Furuya Takahiko, Giesen Joachim, Godil Afzal, Lian Zhouhui, Ohbuchi Ryutarou, Saleem Waqar
SHREC’10 Track
2010 3DOR, pages: 63 - 69

2009

back to the year overview

Bast Hannah
Car or Public Transport - Two Worlds
2009 Efficient Algorithms, pages: 355 - 367
Celikik Marjan, Bast Hannah
Fast Single-Pass Construction of a Half-Inverted Index
2009 SPIRE, pages: 194 - 205
Bast H, Celikik M
Fast error-tolerant search on very large texts
2009 SAC, pages: 1724 - 1731

Sektionen

Publications

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Start Footer

Benutzerspezifische Werkzeuge