<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcq="http://purl.org/dc/terms/"><records count="1" morepages="false" start="1" end="1"><record rownumber="1"><dc:product_type>Journal Article</dc:product_type><dc:title>Vector embeddings by sequence similarity and context for improved compression, similarity search, clustering, organization, and manipulation of cDNA libraries</dc:title><dc:creator>Um, Daniel H; Knowles, David A; Kaiser, Gail E</dc:creator><dc:corporate_author/><dc:editor/><dc:description>This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). By assigning a unique vector embedding to each short sequence, it is possible to more efficiently cluster and improve upon compression performance for the string representations of cDNA libraries. Furthermore, by studying alternative coordinate vector embeddings trained on the context of codon triplets, we can demonstrate clustering based on amino acid properties. Employing this sequence embedding method to encode barcodes and cDNA sequences, we can improve the time complexity of similarity searches. By pairing vector embeddings with an algorithm that determines the vector proximity in Euclidean space, this approach enables quicker and more flexible sequence searches.</dc:description><dc:publisher>Elsevier</dc:publisher><dc:date>2025-02-01</dc:date><dc:nsf_par_id>10569450</dc:nsf_par_id><dc:journal_name>Computational Biology and Chemistry</dc:journal_name><dc:journal_volume>114</dc:journal_volume><dc:journal_issue>C</dc:journal_issue><dc:page_range_or_elocation>108251</dc:page_range_or_elocation><dc:issn>1476-9271</dc:issn><dc:isbn/><dc:doi>https://doi.org/10.1016/j.compbiolchem.2024.108251</dc:doi><dcq:identifierAwardId>2247370; 2313055</dcq:identifierAwardId><dc:subject>Clustering</dc:subject><dc:subject>Compression</dc:subject><dc:subject>Similarity Search</dc:subject><dc:subject>Natural Language Processing</dc:subject><dc:subject>Bioinformatics</dc:subject><dc:subject>Vector Embeddings</dc:subject><dc:version_number/><dc:location/><dc:rights/><dc:institution/><dc:sponsoring_org>National Science Foundation</dc:sponsoring_org></record></records></rdf:RDF>