Pre-processing serves as a enhancement to model’s inputs in the NL2SQL parsing process. Although not strictly necessary, pre-processing significantly contributes to the refinement of NL2SQL parsing.
The purpose of the schema linking is to identify the tables and columns related to the given NL query. It ensures the accurate mapping and processing of key information within the limited input, thereby improving the performance of the NL2SQL task. In the LLMs era, schema linking has become increasingly crucial due to the input length limit of LLMs.
PaperData-anonymous encoding for text-to-sql generation.DescribeThis paper formulates schema linking as a sequential tagging problem and propose a two-stage anonymization model to learn the semantic relationship between schema and NL.PaperRe-examining the role of schema linking in text-to-sql.DescribeThis paper annotates the schema linking information for each instance in the training and development sets of Spider to support a data-driven and systematic study.PaperRESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL.DescribeThis paper proposes a ranking-enhanced encoding framework for schema linking. An additional cross-encoder is trained to classify tables and columns based on the input query. This framework ranks and filters them according to classification probabilities, resulting in a ranked sequence of schema items.PaperC3: Zero-shot Text-to-SQL with ChatGPTDescribeThis paper designs different zero-shot prompts to instruct GPT-3.5 for table and column linking, employing the self-consistency method. For the table linking, the prompt guides the process in three steps: ranking tables by relevance, ensuring all relevant tables are included, and outputting in list format. For the column linking, another prompt guides the ranking of columns within candidate tables and outputting in dictionary format, prioritizing those matching question terms or foreign keys.
The purpose of database content retrieval is to efficiently retrieve cell values through textual searching algorithms and database indexing. Given the large scale of databases, retrieving cell values from them is resource-intensive. Additionally, addressing the requirements of the WHERE and JOIN clauses can significantly optimize NL2SQL performance. Therefore, it is crucial to implement appropriate strategies for the scenario requirement.
PaperBridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic ParsingDescribeBRIDGE designs an anchor text matching to extract cell values mentioned in the NL automatically. It uses a heuristic method to calculate the maximum sequence match between the problem and the cell values to determine the matching boundary. When the cell values are substrings of words in the query, the heuristic can exclude those string matches. The matching threshold is then adjusted by making coarse accuracy measurements.PaperValueNet: A Natural Language-to-SQL System that Learns from Database InformationDescribeValueNet implements three methods for generating candidate cell values based on n-grams method, string similarity and heuristic selection.PaperTaBERT: Pretraining for Joint Understanding of Textual and Tabular DataDescribeTABERT utilizes a method called database content snapshots to encode the relevant subset of database content corresponding to the NL query. It uses an attention mechanism to manage information between cell value representations across different rows.PaperTowards Complex Text-to-SQL in Cross-Domain Database with Intermediate RepresentationDescribeIRNet employs the knowledge graph ConceptNet to recognize cell value links and search cell value candidates in the knowledge graph. When a result exactly or partially matches a cell value, the column is assigned a type of value exact match or partial match, respectively.PaperRAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL ParsersDescribeRAT-SQL improves structural reasoning capabilities by modeling the relationship between cell values and the NL query. Specifically, it identifies the column-value relationship, meaning that the value in the question is part of the candidate cell value of the column.PaperCHESS: Contextual Harnessing for Efficient SQL SynthesisDescribeCHESS utilizes a Locality-sensitive Hashing algorithm for approximate nearest neighbor searches. It indexes unique cell values to quickly identify the top similar values related to the NL query. This approach significantly speeds up the process of computing the edit distance and semantic embedding between the NL query and cell values.PaperCodeS: Towards Building Open-source Language Models for Text-to-SQLDescribeCodeS introduces a coarse-to-fine cell value matching approach. It leverages indexes for a coarse-grained initial search, followed by a fine-grained matching process. First, it builds the index for all values using BM25. The index identifies candidate values relevant to NL. The Longest Common Substring algorithm is then used to calculate the matching degree between NL and the candidate values to find the most relevant cell values.
Additional information (e.g. domain knowledge) plays an essential role in improving the comprehension capabilities of NL2SQL models for understanding the NL query, performing the schema linking, and benefiting the NL2SQL translation. This information can provide demonstration examples, domain knowledge, formulaic evidence, and format information for the NL2SQL backbone model or specific modules, thereby enhancing the quality of the generated results.
-
PaperDIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction -
DescribeDIN-SQL inserts additional information through few-shot learning across multiple stages of the workflow, such as schema linking, query classification, task decomposition, and self-correction. These stages allow DIN-SQL to effectively tackle various challenges, including the complexity of schema links, identification of multiple table joins, and handling of nested queries. -
PaperCodeS: Towards Building Open-source Language Models for Text-to-SQL -
DescribeCodeS utilizes metadata examples of cross-domain databases as the main additional information, including data types and annotation text, which help the model resolve potential ambiguity issues and understand entity relationships. This extracted information is transformed into coherent text and concatenated with the question query to form the final input context. -
PaperPET-SQL: A Prompt-enhanced Two-stage Text-to-SQL Framework with Cross-consistency -
DescribePET-SQL constructs a pool of examples from the training set, which contains question frames and question-SQL pairs. Then, it selects the$k$ examples that are most similar to the target question. These selected examples are combined with customized prompts as the final input. -
PaperText-to-SQL Empowered by Large Language Models: A Benchmark Evaluation -
DescribeDAIL-SQL intricately designs a two-stage representation algorithm for additional information. It begins by presenting the question and database as SQL statement hints, thereby providing comprehensive database information. Following this, it employs a masking mechanism and similarity calculation to select appropriate examples and systematically organizes tags to enhance the efficiency of the algorithm. -
PaperThe Dawn of Natural Language to SQL: Are We Fully Ready? -
DescribeSuperSQL extends the representation algorithm of the DAIL-SQL by integrating similarity-based sample selection with schema linking and database content information, which filters out irrelevant schemas, thereby enhancing the quality of SQL generation. -
PaperTowards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge -
DescribeREGROUP constructs a formulaic knowledge base encompassing various domains, such as finance, real estate, and transportation. It leverages a Dense Passage Retriever (DPR) to compute similarity scores for the retrieval results from the formulaic knowledge base. Subsequently, an Erasing-Then-Awakening (ETA) model is used to integrate the entities in these formulaic knowledge items with the entities in NL and schema. This model filters irrelevant entities below a confidence threshold and maps the remainder to schema elements, thereby grounding knowledge for accurate SQL query generation. -
PaperReboost Large Language Model-based Text-to-SQL, Text-to-Python, and Text-to-Function - with Real Applications in Traffic Domain -
DescribeReBoost engages with the LLMs model using the Explain-Squeeze Schema Linking mechanism. This mechanism is a two-phase strategy. Initially, it presents a generalized schema to the LLMs to establish a foundational understanding. Subsequently, it employs targeted prompting to elicit detailed associations between query phrases and specific database entities, thereby enhancing accuracy in mapping queries to database structures.without incurring excessive token cost. -
PaperSelective Demonstrations for Cross-domain Text-to-SQL -
DescribeODIS proposes SimSQL method to retrieve additional knowledge from cross-domain databases. This method utilizes the BM25 algorithm to measure the resemblance in SQL keywords and schema tokens. The top examples from each database are selected as the demonstrations that closely align with target SQL.