About
Activity
-
The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has more than 130,000 job…
The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has more than 130,000 job…
Liked by Ranjan Sinha, Ph.D.
-
Excited to release the latest version of our Granite models -- with significant performance improvements, expanded context, new hallucination…
Excited to release the latest version of our Granite models -- with significant performance improvements, expanded context, new hallucination…
Liked by Ranjan Sinha, Ph.D.
-
Today, I’m proud to introduce Granite 3.1 – another step forward in IBM's mission to deliver high-performing, enterprise-ready open source AI models.…
Today, I’m proud to introduce Granite 3.1 – another step forward in IBM's mission to deliver high-performing, enterprise-ready open source AI models.…
Liked by Ranjan Sinha, Ph.D.
Experience & Education
Volunteer Experience
-
Technical Board, Digital Transformation
Bio Supply Management Alliance (BSMA)
- Present 9 months
Health
To help build an effective and efficient supply chain digital strategy for the biotech, biopharma & biomedical device industries by developing, and disseminating best practices, knowledge, & research.
-
-
Co-organizer
Bay Area Search Monthly Meetup
- 4 years
Science and Technology
http://www.meetup.com/Bay-Area-Search/
-
Publications
-
Storage Infrastructure in the AI Era using Tape, HDD, and NAND Flash Memory
IEEE Transactions on Magnetics
See publicationMagnetic tape, hard disk drives, and NAND flash memory constitute the tiered storage system, also known as the storage pyramid. This review will cover the recent advancements and prospects of these technologies, explore the shifting dynamics of their interaction, and examine the evolving use cases in the AI field. Additionally, it will briefly address the challenges faced by alternative storage technologies such as optical data storage and DNA data storage.
-
NAND, HDD, and Tape Storage Infrastructure for AI
2024 IEEE 35th Magnetic Recording Conference (TMRC)
See publicationMagnetic tape, hard disk drives, and NAND flash memory constitute the tiered storage system, also known as the storage pyramid. This review will cover the recent advancements and prospects of these technologies, explore the shifting dynamics of their interaction, and examine the evolving use cases in the AI field. Additionally, it will briefly address the challenges faced by alternative storage technologies such as optical data storage and DNA data storage.
-
Book: Digital Marketplaces for Knowledge Intensive Assets
MC Press Online LLC
This book on digital marketplaces primarily focuses on Knowledge Intensive Assets (KIAs) such as data, insights, models, digital twins, APIs, software applications, courseware, advertisements, games, and entertainment. KIAs are becoming significant in the world of trading and monetization. Marketplaces dealing with KIAs are just evolving, but they show the promise of notable growth rate. The steps to manage the trade of the KIAs—product validation, acquisition, pricing, and delivery—present…
This book on digital marketplaces primarily focuses on Knowledge Intensive Assets (KIAs) such as data, insights, models, digital twins, APIs, software applications, courseware, advertisements, games, and entertainment. KIAs are becoming significant in the world of trading and monetization. Marketplaces dealing with KIAs are just evolving, but they show the promise of notable growth rate. The steps to manage the trade of the KIAs—product validation, acquisition, pricing, and delivery—present challenges that are different than those of traditional commodities. The contents of this book present unique value in today's data- and AI-driven economy as all industry sectors are having to handle KIAs. This book provides a detailed look at the challenges and opportunities of marketplaces, especially those dealing with KIAs. As a compendium of marketplace functions, including monetization models and analysis of specific marketplace players, the contents of this book will be beneficial for industry practitioners, researchers, and business executives.
The topics discussed include:
- Description of a broad spectrum of KIA products and their emerging market value
- Challenges in managing the KIA marketplaces
- Technology trends that support marketplace business
- The socio-economic impact of digital marketplacesOther authorsSee publication -
The data dividend: reimagining data strategies to deepen insights
Economist
See publicationData is considered the fuel of the digital era. As digitalisation accelerates resulting from a convergence of trends that are advancing the disruptive power of data, businesses are keen to leverage this resource to maximise their value chain. A data strategy that delivers value can open up new sources of competitive advantage, allowing business leaders to identify risks and opportunities and make smarter, better-informed decisions.
-
Accelerating Enterprise Transformation using Data, AI, and Hybrid Cloud
CIO & CISO Summit, Quartz Event
-
Partnering with the Enterprise Chief Architects to Build a Data-Driven Enterprise
Global Chief Data & Analytics Officer Exchange
-
Accelerating Enterprise Transformation using Data, AI, and Hybrid Cloud
KGLOBAL@Silicon Valley 2021
-
Enterprise Transformation using the Power of Data and AI
Quartz CIO Visions Summit
-
Accelerating Enterprise Data & AI - Turning Data into Business Value
SYNC 2019 Silicon Valley: Tech For Good
-
Accelerating Enterprise Data & AI
Silicon Valley Innovation and Entrepreneurship Forum
-
Predictive Analytics for Customer-Centric Commerce at eBay
Global Predictive Analytics Conference
-
Predictive Analytics for Customer-Centric Commerce at eBay
[Keynote] IDG Business Impact & Big Data, Seoul
-
Data Science for Customer-Centric Commerce at eBay
[Plenary] Data Science Innovation Summit, San Diego
-
Data Science on Hadoop for Customer-Centric Commerce at eBay
[Plenary] Apache Hadoop Innovation Summit, San Diego
-
Predictive Analytics for Customer-Centric Commerce at eBay
[Keynote] Predictive Analytics Innovation Summit, Chicago
-
Eagle: User Profile-based Anomaly Detection for Securing Hadoop Clusters
IEEE International Conference on Big Data, Santa Clara
-
Panel: Key Challenges for Future Big Data to Knowledge (BD2K)
IEEE International Conference on Big Data, Santa Clara
See publicationBig data and data analytics is one of the hottest IT themes in both academics and industry worldwide. In this panel, the panelists will present their point of view on key future challenges for Big Data technologies. The discussion will leverage a diverse set of experiences and viewpoints, since the panel includes participants from both the leadership of R & D labs in major corporations and from research groups conducting high-profile, Big Data research projects at academic and government…
Big data and data analytics is one of the hottest IT themes in both academics and industry worldwide. In this panel, the panelists will present their point of view on key future challenges for Big Data technologies. The discussion will leverage a diverse set of experiences and viewpoints, since the panel includes participants from both the leadership of R & D labs in major corporations and from research groups conducting high-profile, Big Data research projects at academic and government organizations.
-
Advanced Hadoop Cluster Management through Predictive Modeling
Interview with KDnuggets
-
Predictive Analytics for Customer-Centric Commerce at eBay
Predictive Analytics Innovation Summit, San Diego
-
Data Science Applications with Hadoop at eBay
Apache Hadoop Innovation Summit, San Diego
-
Astro: A Predictive Model for Anomaly Detection and Feedback-based Scheduling on Hadoop
IEEE International Conference on Big Data, Santa Clara
-
Building a Personalization Platform with Cassandra
Cassandra Day Silicon Valley
-
Recommendations and Personalization at eBay
Invited Talk at Google HQ, Mountain View.
-
Engineering a Scalable, Cache and Space Efficient Trie for Strings
The International VLDB Journal
Storing and retrieving strings in main memory is a fundamental problem in computer science. The efficiency of string data structures used for this task is of paramount importance for applications such as in-memory databases, text-based search engines and dictionaries. The burst trie is a leading choice for such tasks, as it can provide fast sorted access to strings. The burst trie, however, uses linked lists as substructures which can result in poor use of CPU cache and main memory. Previous…
Storing and retrieving strings in main memory is a fundamental problem in computer science. The efficiency of string data structures used for this task is of paramount importance for applications such as in-memory databases, text-based search engines and dictionaries. The burst trie is a leading choice for such tasks, as it can provide fast sorted access to strings. The burst trie, however, uses linked lists as substructures which can result in poor use of CPU cache and main memory. Previous research addressed this issue by replacing linked lists with dynamic arrays forming a cache-conscious array burst trie. Though faster, this variant can incur high instruction costs which can hinder its efficiency. Thus, engineering a fast, compact, and scalable trie for strings remains an open problem. In this paper, we introduce a novel and practical solution that carefully combines a trie with a hash table, creating a variant of burst trie called HAT-trie. We provide a thorough experimental analysis which demonstrates that for large set of strings and on alternative computing architectures, the HAT-trie—and two novel variants engineered to achieve further space-efficiency—is currently the leading in-memory trie-based data structure offering rapid, compact, and scalable storage and retrieval of variable-length strings.
Other authors -
-
Engineering Burstsort: Towards Fast In-place String Sorting
ACM J. of Experimental Algorithmics
Burstsort is a trie-based string sorting algorithm that distributes strings into small buckets whose contents are then sorted in cache. This approach has earlier been demonstrated to be efficient on modern cache-based processors [Sinha & Zobel, JEA 2004]. In this article, we introduce improvements that reduce by a significant margin the memory requirement of Burstsort: It is now less than 1% greater than an in-place algorithm. These techniques can be applied to existing variants of Burstsort…
Burstsort is a trie-based string sorting algorithm that distributes strings into small buckets whose contents are then sorted in cache. This approach has earlier been demonstrated to be efficient on modern cache-based processors [Sinha & Zobel, JEA 2004]. In this article, we introduce improvements that reduce by a significant margin the memory requirement of Burstsort: It is now less than 1% greater than an in-place algorithm. These techniques can be applied to existing variants of Burstsort, as well as other string algorithms such as for string management.
We redesigned the buckets, introducing sub-buckets and an index structure for them, which resulted in an order-of-magnitude space reduction. We also show the practicality of moving some fields from the trie nodes to the insertion point (for the next string pointer) in the bucket; this technique reduces memory usage of the trie nodes by one-third. Importantly, the trade-off for the reduction in memory use is only a very slight increase in the running time of Burstsort on real-world string collections. In addition, during the bucket-sorting phase, the string suffixes are copied to a small buffer to improve their spatial locality, lowering the running time of Burstsort by up to 30%. These memory usage enhancements have enabled the copy-based approach [Sinha et al., JEA 2006] to also reduce the memory usage with negligible impact on speed.Other authorsSee publication -
RepMaestro: scalable repeat detection on disk-based genome sequences
Bioinformatics Journal
Motivation: We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk resident. Supermaximal repeats, maximal unique matches (MuMs) and pairwise branching tandem repeats…
Motivation: We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk resident. Supermaximal repeats, maximal unique matches (MuMs) and pairwise branching tandem repeats have been used to demonstrate the practicality of our approach; the first such study to use an enhanced SA to detect these repeats in large genome sequences.
Results: The detection of supermaximal repeats was observed to be up to two times faster than Vmatch, but more importantly, was shown to scale efficiently to large genome sequences that Vmatch could not process due to memory constraints (4 GB). Similar results were observed for the detection of MuMs, with RepMaestro shown to scale well and also perform up to six times faster than Vmatch. For tandem repeats, RepMaestro was found to be slower but could nonetheless scale to large disk-resident sequences. These results are a significant advance in the quest of scalable repeat detection.Other authors -
-
Reducing space requirements for disk resident suffix arrays
Database Systems for Advanced Applications
Suffix trees and suffix arrays are important data structures for string processing, providing efficient solutions for many applications involving pattern matching. Recent work by Sinha et al. (SIGMOD 2008) addressed the problem of arranging a suffix array on disk so that querying is fast, and showed that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered many times faster than alternative disk-based suffix trees. A drawback of their…
Suffix trees and suffix arrays are important data structures for string processing, providing efficient solutions for many applications involving pattern matching. Recent work by Sinha et al. (SIGMOD 2008) addressed the problem of arranging a suffix array on disk so that querying is fast, and showed that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered many times faster than alternative disk-based suffix trees. A drawback of their LOF-SA structure, and common to all current disk resident suffix tree/array approaches, is that the space requirement of the data structure, though on disk, is large relative to the text – for the LOF-SA, 13n bytes including the underlying n byte text. In this paper we explore techniques for reducing the space required by the LOF-SA. Experiments show these methods cut the data structure to nearly half its original size, without, for large strings that necessitate on-disk structures, any impact on search times.
Other authorsSee publication -
OzSort: Sorting 100GB for less than 87kJoules
Sort Benchmark (Joulesort)
OzSort is a fast and stable external sorting software specifically optimized for the require- ments of the 2009 PennySort (Indy) benchmark. OzSort can sort over 246GB of data for under a penny, using a standard desktop PC. The data sorted consisted of 100 byte records with (random) keys of length 10 bytes starting from the first byte of each record. In this paper, we apply our OzSort software and its associated hardware components to address the important issue of energy efficiency. By…
OzSort is a fast and stable external sorting software specifically optimized for the require- ments of the 2009 PennySort (Indy) benchmark. OzSort can sort over 246GB of data for under a penny, using a standard desktop PC. The data sorted consisted of 100 byte records with (random) keys of length 10 bytes starting from the first byte of each record. In this paper, we apply our OzSort software and its associated hardware components to address the important issue of energy efficiency. By using our software and hardware components, we demonstrate that the power con- sumption of OzSort on a general-purpose desktop PC can rival that of CoolSort. OzSort could sort 100GB of data in about 827s using less than 87kJoules — 11,597 records sorted/joule. These results exceed that of CoolSort (Daytona class) in 2007, which used mobile processor technology and a RAID of laptop hard drives.Other authors -
-
OzSort: Sorting over 246GB for a Penny
Sort Benchmark (Pennysort)
OzSort is a fast and stable external sorting software that is specifically optimized for the requirements of the PennySort (Indy) benchmark. The software sorts a file containing records of 100
bytes each, with keys of length 10 bytes, where a key starts from the first byte of each record and
whose byte-length characters are generated uniformly at random. OzSort can sort over 246GB in
less than 2150.9 seconds; that is, for less than a penny and extends the previous record set by…OzSort is a fast and stable external sorting software that is specifically optimized for the requirements of the PennySort (Indy) benchmark. The software sorts a file containing records of 100
bytes each, with keys of length 10 bytes, where a key starts from the first byte of each record and
whose byte-length characters are generated uniformly at random. OzSort can sort over 246GB in
less than 2150.9 seconds; that is, for less than a penny and extends the previous record set by psort
in 2008 by an additional 56.7GB. We have used a computer that costs US$439.85 and provides
a time budget ≥ 3×365×24×3600 seconds. The number of records that could be sorted within the 43985 time budget was 2, 463, 105, 024. OzSort can also be extended to cater to the Daytona class and be able to sort records and keys of varying sizes.Other authors -
-
A Fast Hybrid Short Read Fragment Assembly Algorithm
Bioinformatics Journal
The shorter and vastly more numerous reads produced by second-generation sequencing technologies require new tools that can assemble massive numbers of reads in reasonable time. Existing short-read assembly tools can be classified into two categories: greedy extension-based and graph-based. While the graph-based approaches are generally superior in terms of assembly quality, the computer resources required for building and storing a huge graph are very high. In this article, we present Taipan…
The shorter and vastly more numerous reads produced by second-generation sequencing technologies require new tools that can assemble massive numbers of reads in reasonable time. Existing short-read assembly tools can be classified into two categories: greedy extension-based and graph-based. While the graph-based approaches are generally superior in terms of assembly quality, the computer resources required for building and storing a huge graph are very high. In this article, we present Taipan, an assembly algorithm which can be viewed as a hybrid of these two approaches. Taipan uses greedy extensions for contig construction but at each step realizes enough of the corresponding read graph to make better decisions as to how assembly should continue. We show that this approach can achieve an assembly quality at least as good as the graph-based approaches used in the popular Edena and Velvet assembly tools using a moderate amount of computing resources.
Other authorsSee publication -
SHREC: A short-read error correction method
Bioinformatics Journal
Motivation: Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read…
Motivation: Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data.
Results: We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.Other authorsSee publication -
Improving Suffix Array Locality for Fast Pattern Matching on Disk
ACM SIGMOD
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best…
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best alternative disk-based suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.
Other authorsSee publication -
Detection of near-duplicate images for web search
6th ACM international conference on Image and video retrieval
Among the vast numbers of images on the web are many du- plicates and near-duplicates, that is, variants derived from the same original image. Such near-duplicates appear in many web image searches and may represent infringements of copyright or indicate the presence of redundancy. While methods for identifying near-duplicates have been investi- gated, there has been no analysis of the kinds of alterations that are common on the web or evaluation of whether real cases of near-duplication can in…
Among the vast numbers of images on the web are many du- plicates and near-duplicates, that is, variants derived from the same original image. Such near-duplicates appear in many web image searches and may represent infringements of copyright or indicate the presence of redundancy. While methods for identifying near-duplicates have been investi- gated, there has been no analysis of the kinds of alterations that are common on the web or evaluation of whether real cases of near-duplication can in fact be identified. In this paper we use popular queries and a commercial image search service to collect images that we then manually analyse for instances of near-duplication. We show that such duplica- tion is indeed significant, but that not all kinds of image alteration explored in previous literature are evident in web data. Removal of near-duplicates from a collection is im- practical, but we propose that they be removed from sets of answers. We evaluate our technique for automatic identifi- cation of near duplicates during query evaluation and show that it has promise as an effective mechanism for manage- ment of near-duplication in practice.
Other authorsSee publication -
SICO: A System for Detection of Near-Duplicate Images During Search.
International Conference on Multimedia and Expo (ICME)
Duplicate and near-duplicate digital image matching is beneficial for image search in terms of collection manage- ment, digital content protection, and search efficiency. In this paper, we introduce SICO, a novel system for near-duplicate image detection during web search. It accurately detects near- duplicates in the answers returned by commercial image search engines in real-time. We show that SICO — which utilizes PCA-SIFT local descriptors and adapts near-duplicate text document…
Duplicate and near-duplicate digital image matching is beneficial for image search in terms of collection manage- ment, digital content protection, and search efficiency. In this paper, we introduce SICO, a novel system for near-duplicate image detection during web search. It accurately detects near- duplicates in the answers returned by commercial image search engines in real-time. We show that SICO — which utilizes PCA-SIFT local descriptors and adapts near-duplicate text document detection techniques — is both effective and effi- cient. On a standard desktop personal computer, SICO iden- tifies clusters of near-duplicate images with 93% accuracy in under 30 seconds, for an average of 622 returned images for each query.Other authorsSee publication -
Using redundant bit vectors for near-duplicate image detection
Advances in Databases: Concepts, Systems and Applications
Images are amongst the most widely proliferated form of digital information due to affordable imaging technologies and the Web. In such an environment, the use of digital watermarking for image copy- right infringement detection is a challenge. For such tasks, near-duplicate image detection is increasingly attractive due to its ability of auto- mated content analysis; moreover, the application domain also extends to data management. The application of PCA-SIFT features and Locality- Sensitive…
Images are amongst the most widely proliferated form of digital information due to affordable imaging technologies and the Web. In such an environment, the use of digital watermarking for image copy- right infringement detection is a challenge. For such tasks, near-duplicate image detection is increasingly attractive due to its ability of auto- mated content analysis; moreover, the application domain also extends to data management. The application of PCA-SIFT features and Locality- Sensitive Hashing (LSH) — for indexing and retrieval — has been shown to be highly effective for this task. In this work, we prune the number of PCA-SIFT features and introduce a modified Redundant Bit Vec- tor (RBV) index. This is the first application of the RBV index that shows near-perfect effectiveness. Using the best parameters of our RBV approach, we observe an average recall and precision of 91% and 98%, respectively, with query response time of under 10 seconds on a collection of 20, 000 images. Compared to the baseline (the LSH index), the query response times and index size of the RBV index is 12 times faster and 126 times smaller, respectively. As compared to brute-force sequential scan, the RBV index rapidly reduces the search space to 1/80.
Other authorsSee publication -
Pruning SIFT for scalable near-duplicate image matching
Australasian Database Conference
The detection of image versions from large image col- lections is a formidable task as two images are rarely identical. Geometric variations such as cropping, ro- tation, and slight photometric alteration are unsuit- able for content-based retrieval techniques, whereas digital watermarking techniques have limited applica- tion for practical retrieval. Recently, the application of Scale Invariant Feature Transform (SIFT) interest points to this domain have shown high effectiveness, but…
The detection of image versions from large image col- lections is a formidable task as two images are rarely identical. Geometric variations such as cropping, ro- tation, and slight photometric alteration are unsuit- able for content-based retrieval techniques, whereas digital watermarking techniques have limited applica- tion for practical retrieval. Recently, the application of Scale Invariant Feature Transform (SIFT) interest points to this domain have shown high effectiveness, but scalability remains a problem due to the large number of features generated for each image. In this work, we show that for this application domain, the SIFT interest points can be dramatically pruned to effect large reductions in both memory requirements and query run-time, with almost negligible loss in ef- fectiveness. We demonstrate that, unlike using the original SIFT features, the pruned features scales bet- ter for collections containing hundreds of thousands of images.
Other authorsSee publication -
HAT-trie: a cache-conscious trie-based data structure for strings
Thirtieth Australasian conference on Computer science
Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not how- ever, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components. We evaluate performance using several real-world…
Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not how- ever, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components. We evaluate performance using several real-world datasets and against other high- performance data structures. We show strong im- provements in both time and space; in most cases ap- proaching that of the cache-conscious hash table. Our HAT-trie is shown to be the most efficient trie-based data structure for managing variable-length strings in-memory while maintaining sort order.
Other authors -
-
Discovery of image versions in large collections
Advances in Multimedia Modeling
Image collections may contain multiple copies, versions, and fragments of the same image. Storage or retrieval of such duplicates and near-duplicates may be unnecessary and, in the context of collections de- rived from the web, their presence may represent infringements of copy- right. However, identifying image versions is a challenging problem, as they can be subject to a wide range of digital alterations, and is poten- tially costly as the number of image pairs to be considered is quadratic…
Image collections may contain multiple copies, versions, and fragments of the same image. Storage or retrieval of such duplicates and near-duplicates may be unnecessary and, in the context of collections de- rived from the web, their presence may represent infringements of copy- right. However, identifying image versions is a challenging problem, as they can be subject to a wide range of digital alterations, and is poten- tially costly as the number of image pairs to be considered is quadratic in collection size. In this paper, we propose a method for finding the pairs of near-duplicates based on manipulation of an image index. Our approach is an adaptation of a robust object recognition technique and a near-duplicate document detection algorithm to this application domain. We show that this method requires only moderate computing resources, and is highly effective at identifying pairs of near-duplicates.
Other authorsSee publication -
Clustering Near-duplicate Images in Large Collections
Multimedia Information Retrieval
Near-duplicate images introduce problems of redundancy and copyright infringement in large image collections. The problem is acute on the web, where appropriation of im- ages without acknowledgment of source is prevalent. In this paper, we present an effective clustering approach for near- duplicate images, using a combination of techniques from invariant image local descriptors and an adaptation of near- duplicate text-document clustering techniques; we extend our earlier approach of…
Near-duplicate images introduce problems of redundancy and copyright infringement in large image collections. The problem is acute on the web, where appropriation of im- ages without acknowledgment of source is prevalent. In this paper, we present an effective clustering approach for near- duplicate images, using a combination of techniques from invariant image local descriptors and an adaptation of near- duplicate text-document clustering techniques; we extend our earlier approach of near-duplicate image pairwise iden- tification for this clustering approach. We demonstrate that our clustering approach is highly effective for collections of up to a few hundred thousand images. We also show — via experimentation with real examples — that our approach presents a viable solution for clustering near-duplicate im- ages on the Web.
Other authorsSee publication -
Detection of Image Versions For Web Search
ACM International Conference on Image and Video Retrieval
-
Cache-efficient string sorting using copying
ACM J. of Experimental Algorithmics
Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in out-of-cache references. In this paper, we introduce C-burstsort…
Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in out-of-cache references. In this paper, we introduce C-burstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CP-burstsort, uses more memory, but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings.
Other authorsSee publication -
Using random sampling to build approximate tries for efficient string sorting
ACM Journal of Experimental Algorithmics (JEA)
Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string-sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache efficient. Burstsort dynamically builds a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of…
Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string-sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache efficient. Burstsort dynamically builds a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce, by up to 37%, cache misses further than did the original burstsort, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained.
Other authorsSee publication -
Using compact tries for cache-efficient sorting of integers
Workshop of Experimental and Efficient Algorithms (WEA)
See publicationThe increasing latency between memory and processor speeds has made it imperative for algorithms to reduce expensive accesses to main memory. In earlier work, we presented cache-conscious algorithms for sorting strings, that have been shown to be almost two times faster than the previous algorithms, mainly due to better usage of the cache. In this paper, we propose two new algorithms, Burstsort and MEBurstsort, for sorting large sets of integer keys. Our algorithms use a novel approach for…
The increasing latency between memory and processor speeds has made it imperative for algorithms to reduce expensive accesses to main memory. In earlier work, we presented cache-conscious algorithms for sorting strings, that have been shown to be almost two times faster than the previous algorithms, mainly due to better usage of the cache. In this paper, we propose two new algorithms, Burstsort and MEBurstsort, for sorting large sets of integer keys. Our algorithms use a novel approach for sorting integers, by dynamically constructing a compact trie which is used to allocate the keys to containers. These keys are then sorted within the cache. The new algorithms are simple, fast and efficient. We compare them against the best existing algorithms using several collections and data sizes. Our results show that MEBurstsort is up to 3.5 times faster than memory-tuned quicksort for 64-bit keys and up to 2.5 times faster for 32-bit keys. For 32-bit keys, on 10 of the 11 collections used, MEBurstsort was the fastest, whereas for 64-bit keys, it was the fastest for all collections.
-
Using random sampling to build approximate tries for efficient string sorting
Workshop of Experimental and Efficient Algorithms (WEA)
Algorithms for sorting large datasets can be made more effi- cient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we…
Algorithms for sorting large datasets can be made more effi- cient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we in- troduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimen- tal results with sets of over 30 million strings show that the new vari- ants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained.
Other authorsSee publication -
Cache-conscious sorting of large sets of strings with dynamic tries
ACM J. of Experimental Algorithmics
Ongoing changes in computer architecture are affecting the efficiency of string-sorting algorithms. The size of main memory in typical computers continues to grow but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-sorting algorithms as they do not utilize cache well for large data sets. We propose a new sorting algorithm for strings, burstsort, based on dynamic construction of a compact trie in which strings are…
Ongoing changes in computer architecture are affecting the efficiency of string-sorting algorithms. The size of main memory in typical computers continues to grow but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-sorting algorithms as they do not utilize cache well for large data sets. We propose a new sorting algorithm for strings, burstsort, based on dynamic construction of a compact trie in which strings are kept in buckets. It is simple, fast, and efficient. We experimentally explore key implementation options and compare burstsort to existing string-sorting algorithms on large and small sets of strings with a range of characteristics. These experiments show that, for large sets of strings, burstsort is almost twice as fast as any previous algorithm, primarily due to a lower rate of cache miss.
Other authorsSee publication -
Efficient trie-based sorting of large sets of strings
26th Australasian computer science conference
Sorting is a fundamental algorithmic task. Many general-purpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a trie-based algorithm for sorting strings, is for large data sets more efficient than all previous algorithms for this task. In this paper we re-evaluate some of the implementation details of burstsort, in particular the method for…
Sorting is a fundamental algorithmic task. Many general-purpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a trie-based algorithm for sorting strings, is for large data sets more efficient than all previous algorithms for this task. In this paper we re-evaluate some of the implementation details of burstsort, in particular the method for managing buckets held at leaves. We show that better choice of data structures further improves the efficiency, at a small additional cost in memory. For sets of around 30,000,000 strings, our improved burstsort is nearly twice as fast as the previous best sorting algorithm.
Other authorsSee publication -
E-commerce Personalization at Scale
23rd International Conference on Information and Knowledge Management (CIKM), Shanghai
Patents
-
Generating personalized user recommendations using word vectors
Issued 11176145
In various example embodiments, a system and method for constructing and scoring word vectors between natural language words and generating output to a user in the form of personalized recommendations are presented.
Other inventorsSee patent -
Personalization Platform
Issued US 20190057158
A personalization system includes a user events module configured to receive a plurality of user events, each user event of the plurality of user events including one or more of a transactional event and a behavioral event associated with the online user, and a personalization cluster including a plurality of personalization servers, each personalization server of the plurality of personalization servers configured to receive a personalization request from a requesting system, the…
A personalization system includes a user events module configured to receive a plurality of user events, each user event of the plurality of user events including one or more of a transactional event and a behavioral event associated with the online user, and a personalization cluster including a plurality of personalization servers, each personalization server of the plurality of personalization servers configured to receive a personalization request from a requesting system, the personalization request including a plurality of intermediate results identified by the requesting system, each intermediate result representing a possible outcome that may be presented by the requesting system to the online user, compute a score for each intermediate result of the plurality of immediate results based at least in part on the plurality of user events, thereby generating a plurality of scores, and return the plurality of scores to the requesting system.
Other inventorsSee patent
Courses
-
Effective Executive Speaking by American Management Association, 2015
-
-
Generative AI: Foundation Models and Platforms (IBM and Coursera)
-
-
Generative AI: Introduction and Applications
-
-
Generative AI: Prompt Engineering Basics (IBM and Coursera)
-
-
IBM Quantum Conversations Badge, 2020
-
Honors & Awards
-
IBM Champion Learner Gold
IBM
Top 4% of all learners at IBM in 2024.
-
IBM Tech 2023
IBM
This is an inaugural award to recognize the accomplishments of IBM’s premier technical contributors and a testament to the hard work throughout 2022 to drive innovation, transform culture and accelerate growth.
-
IBM Fellow
IBM
An IBM Fellow is a position at IBM appointed by the CEO. Fellow is the highest honor a scientist, engineer, or programmer at IBM can achieve. As of April 2022, only 331 IBMers have earned the IBM Fellow distinction.
-
Panel: Adapting New Skills for the 4th Industrial Revolution | Coursera
-
Andrew Ng (Coursera Founder), Jeff Maggioncalda (Coursera CEO), Emily Glassberg Sands (Head of Data Science), and Ranjan Sinha (CTO and VP for IBM Global Chief Data Office) discuss with Bay Area leaders how forward-looking companies are preparing to educate and adapt new skills for the “4th Industrial Revolution.”
Topics include:
-The critical need of data literacy
-Evolving your talent to harness the data revolution
-Developing new skills to capitalize on the data…Andrew Ng (Coursera Founder), Jeff Maggioncalda (Coursera CEO), Emily Glassberg Sands (Head of Data Science), and Ranjan Sinha (CTO and VP for IBM Global Chief Data Office) discuss with Bay Area leaders how forward-looking companies are preparing to educate and adapt new skills for the “4th Industrial Revolution.”
Topics include:
-The critical need of data literacy
-Evolving your talent to harness the data revolution
-Developing new skills to capitalize on the data revolution
-Group discussion on challenges and insights -
Panel: Data Drives Efficiency
Data Business Congress - IAM and GDR
Creating value in your business practices and processes from data – and the mechanisms you need to put in place to ensure that you can do so as effectively as possible.
· Identifying the good stuff
· Understanding the options
· Working with third party advisers -
IBM Super Learner
IBM
Top 10% of learners at IBM.
-
Top Alumnis
RMIT University
Profiled in the RMIT Alumni Celebration Wall amongst 10 notable alumnis.
-
Critical Talent Award
eBay Inc.
eBay's prestigious critical talent award is given to the top 1% of performers.
-
Patent Award
eBay Inc.
-
Engineering Quality Award
eBay Inc.
Awarded by eBay's CEO, John Donahue.
-
Poster Award
eBay Inc.
Thanks to the entire team and our dynamic research intern, Puja Das.
-
Joulesort Benchmark Medal
Sort Benchmark
Sorted 100GB of data in about 827s using less than 87kJoules — 11,597 records sorted/joule. The sort benchmarks started in 1985 and defined, sponsored and administered by Jim Gray (Turing award winner).
-
Pennysort Benchmark Medal
Sort Benchmark
Sorted over 246GB in less than 2150.9 seconds; that is, for less than a penny and extends the previous record by an additional 56.7GB. The sort benchmarks started in 1985 and defined, sponsored and administered by Jim Gray (Turing award winner).
-
Early Career Researcher Grant
University of Melbourne
Chief Investigator for "System for Large Scale Sequence Analysis in Genomic Databases".
-
ARC Discovery Project Fellowship
Australian Research Council
Chief Investigator for "In-memory Sorting, Searching and Indexing on Modern Multi-core Cache-based and Graphics Processor Architectures". The Discovery Projects scheme provides federal funding for excellent fundamental research projects and are highly coveted and competitive.
-
Academic Excellence in the Doctor of Philosophy (PhD) Program
RMIT University (Microsoft prize)
-
Asia-Pacific Young Inventors Awards
Asian Wall Street Journal and Hewlett-Packard
Appeared on a center-page article on Asia’s Cutting-Edge Crusaders. Organized by Asian Wall Street Journal and Hewlett Packard Enterprises.
-
Academic Excellence in Master of Technology
RMIT University
-
Quality Award
eBay Inc.
-
Shining Star Award
eBay Inc.
-
Spot Award
eBay Inc.
Test Scores
-
GRE Quantitative Reasoning (Mathematics)
Score: 800/800
Organizations
-
Bio Supply Management Alliance
Technical Board, Digital Transformation
- PresentTo help build an effective and efficient supply chain strategy for the biotech, biopharma & biomedical device industries by developing, and disseminating best practices, knowledge, & research. https://biosupplyalliance.com
-
IBM Academy of Technology
Member
- Presenthttps://www.ibm.com/blogs/academy-of-technology/
-
American Computing Machinery
Member
- Present -
IEEE
Member
- Present -
IEEE Computer Society Technical Committee on Data Engineering
Member
- Present -
Citizen's Climate Lobby
Member
- Present -
Forbes Technology Council
Member
- Present -
Business Performance Innovation Network
Member
- Present
More activity by Ranjan
-
“Driving Responsible Approaches to AI Through Operations and Development Workflows - with Ranjan Sinha, Ph.D. of IBM and Tsavo Knott of Pieces for…
“Driving Responsible Approaches to AI Through Operations and Development Workflows - with Ranjan Sinha, Ph.D. of IBM and Tsavo Knott of Pieces for…
Liked by Ranjan Sinha, Ph.D.
-
Look at those beautiful Granite Guardian safety vests! #brand #bootleg The Granite Guardian technical report is now on arXiv:…
Look at those beautiful Granite Guardian safety vests! #brand #bootleg The Granite Guardian technical report is now on arXiv:…
Liked by Ranjan Sinha, Ph.D.
-
📢 Come visit us at the IBM Research booth at NeurIPS! Check out our demos, papers, and posters on generative AI and connect with our researchers to…
📢 Come visit us at the IBM Research booth at NeurIPS! Check out our demos, papers, and posters on generative AI and connect with our researchers to…
Liked by Ranjan Sinha, Ph.D.
-
I'm at NeurIPS in Vancouver this week. Swing by the IBM booth and let's connect! I'd love to chat about what's exciting everyone in Gen AI…
I'm at NeurIPS in Vancouver this week. Swing by the IBM booth and let's connect! I'd love to chat about what's exciting everyone in Gen AI…
Liked by Ranjan Sinha, Ph.D.
-
Unstructured data is at the heart of how businesses build and deploy gen AI solutions, e.g., RAG. Diving into quality filtering and curation of…
Unstructured data is at the heart of how businesses build and deploy gen AI solutions, e.g., RAG. Diving into quality filtering and curation of…
Liked by Ranjan Sinha, Ph.D.
-
I am thrilled to announce the relaunch of Virtual Gold, marking an exciting new chapter in my lifelong journey of leveraging data and AI to unlock…
I am thrilled to announce the relaunch of Virtual Gold, marking an exciting new chapter in my lifelong journey of leveraging data and AI to unlock…
Liked by Ranjan Sinha, Ph.D.
-
Open-source Small Language Models like Granite are becoming increasingly powerful and when customized with a highly efficient methodology like…
Open-source Small Language Models like Granite are becoming increasingly powerful and when customized with a highly efficient methodology like…
Liked by Ranjan Sinha, Ph.D.
-
Last week, I sat down with Ash Jhaveri, Meta’s VP of AI Partnerships, and Clay Shirky, NYU’s Provost on AI Education, at washingtonpost.com Live, to…
Last week, I sat down with Ash Jhaveri, Meta’s VP of AI Partnerships, and Clay Shirky, NYU’s Provost on AI Education, at washingtonpost.com Live, to…
Liked by Ranjan Sinha, Ph.D.
-
One year ago, IBM, Meta, and over 50 organizations came together to launch the AI Alliance, a groundbreaking initiative to make AI accessible…
One year ago, IBM, Meta, and over 50 organizations came together to launch the AI Alliance, a groundbreaking initiative to make AI accessible…
Liked by Ranjan Sinha, Ph.D.
-
Thank you to our colleagues and friends at Dell Technologies, QTS Data Centers and NVIDIA for braving the chilly weather and coming to IBM Research's…
Thank you to our colleagues and friends at Dell Technologies, QTS Data Centers and NVIDIA for braving the chilly weather and coming to IBM Research's…
Liked by Ranjan Sinha, Ph.D.
-
This year, I am particularly thankful for the incredible communities I’m privileged to be part of — from colleagues and collaborators in AI and data…
This year, I am particularly thankful for the incredible communities I’m privileged to be part of — from colleagues and collaborators in AI and data…
Liked by Ranjan Sinha, Ph.D.
-
If you’re a leader, the future is certainly bright at IBM which has been named #3 on the TIME “The Best Companies for Future Leaders List”! The…
If you’re a leader, the future is certainly bright at IBM which has been named #3 on the TIME “The Best Companies for Future Leaders List”! The…
Liked by Ranjan Sinha, Ph.D.
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Ranjan Sinha, Ph.D.
1 other named Ranjan Sinha, Ph.D. is on LinkedIn
See others named Ranjan Sinha, Ph.D.