San Jose, California, United States Contact Info
3K followers 500+ connections

Join to view profile

About

Ranjan Sinha is an IBM Fellow, Vice President, and Chief Technology Officer for watsonx…

Activity

Join now to see all activity

Experience & Education

  • IBM

View Ranjan’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Volunteer Experience

  • Bio Supply Management Alliance (BSMA) Graphic

    Technical Board, Digital Transformation

    Bio Supply Management Alliance (BSMA)

    - Present 9 months

    Health

    To help build an effective and efficient supply chain digital strategy for the biotech, biopharma & biomedical device industries by developing, and disseminating best practices, knowledge, & research.

  • Narika Graphic

    Member of Board of Directors

    Narika

    - Present 5 years 6 months

    Human Rights

  • Co-organizer

    Bay Area Search Monthly Meetup

    - 4 years

    Science and Technology

    http://www.meetup.com/Bay-Area-Search/

  • Harvard Business Review Graphic

    Advisory Council

    Harvard Business Review

    - Present 5 years

    Science and Technology

Publications

  • Storage Infrastructure in the AI Era using Tape, HDD, and NAND Flash Memory

    IEEE Transactions on Magnetics

    Magnetic tape, hard disk drives, and NAND flash memory constitute the tiered storage system, also known as the storage pyramid. This review will cover the recent advancements and prospects of these technologies, explore the shifting dynamics of their interaction, and examine the evolving use cases in the AI field. Additionally, it will briefly address the challenges faced by alternative storage technologies such as optical data storage and DNA data storage.

    See publication
  • NAND, HDD, and Tape Storage Infrastructure for AI

    2024 IEEE 35th Magnetic Recording Conference (TMRC)

    Magnetic tape, hard disk drives, and NAND flash memory constitute the tiered storage system, also known as the storage pyramid. This review will cover the recent advancements and prospects of these technologies, explore the shifting dynamics of their interaction, and examine the evolving use cases in the AI field. Additionally, it will briefly address the challenges faced by alternative storage technologies such as optical data storage and DNA data storage.

    See publication
  • Book: Digital Marketplaces for Knowledge Intensive Assets

    MC Press Online LLC

    This book on digital marketplaces primarily focuses on Knowledge Intensive Assets (KIAs) such as data, insights, models, digital twins, APIs, software applications, courseware, advertisements, games, and entertainment. KIAs are becoming significant in the world of trading and monetization. Marketplaces dealing with KIAs are just evolving, but they show the promise of notable growth rate. The steps to manage the trade of the KIAs—product validation, acquisition, pricing, and delivery—present…

    This book on digital marketplaces primarily focuses on Knowledge Intensive Assets (KIAs) such as data, insights, models, digital twins, APIs, software applications, courseware, advertisements, games, and entertainment. KIAs are becoming significant in the world of trading and monetization. Marketplaces dealing with KIAs are just evolving, but they show the promise of notable growth rate. The steps to manage the trade of the KIAs—product validation, acquisition, pricing, and delivery—present challenges that are different than those of traditional commodities. The contents of this book present unique value in today's data- and AI-driven economy as all industry sectors are having to handle KIAs. This book provides a detailed look at the challenges and opportunities of marketplaces, especially those dealing with KIAs. As a compendium of marketplace functions, including monetization models and analysis of specific marketplace players, the contents of this book will be beneficial for industry practitioners, researchers, and business executives.

    The topics discussed include:

    - Description of a broad spectrum of KIA products and their emerging market value
    - Challenges in managing the KIA marketplaces
    - Technology trends that support marketplace business
    - The socio-economic impact of digital marketplaces

    Other authors
    See publication
  • The data dividend: reimagining data strategies to deepen insights

    Economist

    Data is considered the fuel of the digital era. As digitalisation accelerates resulting from a convergence of trends that are advancing the disruptive power of data, businesses are keen to leverage this resource to maximise their value chain. A data strategy that delivers value can open up new sources of competitive advantage, allowing business leaders to identify risks and opportunities and make smarter, better-informed decisions.

    See publication
  • Accelerating Enterprise Transformation using Data, AI, and Hybrid Cloud

    CIO & CISO Summit, Quartz Event

  • Partnering with the Enterprise Chief Architects to Build a Data-Driven Enterprise

    Global Chief Data & Analytics Officer Exchange

  • Enterprise Transformation using the Power of Data and AI

    Quartz CIO Visions Summit

  • Accelerating Enterprise Data & AI - Turning Data into Business Value

    SYNC 2019 Silicon Valley: Tech For Good

  • Accelerating Enterprise Data & AI

    Silicon Valley Innovation and Entrepreneurship Forum

  • Predictive Analytics for Customer-Centric Commerce at eBay

    Global Predictive Analytics Conference

  • Predictive Analytics for Customer-Centric Commerce at eBay

    [Keynote] IDG Business Impact & Big Data, Seoul

  • Data Science for Customer-Centric Commerce at eBay

    [Plenary] Data Science Innovation Summit, San Diego

  • Data Science on Hadoop for Customer-Centric Commerce at eBay

    [Plenary] Apache Hadoop Innovation Summit, San Diego

  • Predictive Analytics for Customer-Centric Commerce at eBay

    [Keynote] Predictive Analytics Innovation Summit, Chicago

  • Eagle: User Profile-based Anomaly Detection for Securing Hadoop Clusters

    IEEE International Conference on Big Data, Santa Clara

    http://goeagle.io

    Other authors
    See publication
  • Panel: Key Challenges for Future Big Data to Knowledge (BD2K)

    IEEE International Conference on Big Data, Santa Clara

    Big data and data analytics is one of the hottest IT themes in both academics and industry worldwide. In this panel, the panelists will present their point of view on key future challenges for Big Data technologies. The discussion will leverage a diverse set of experiences and viewpoints, since the panel includes participants from both the leadership of R & D labs in major corporations and from research groups conducting high-profile, Big Data research projects at academic and government…

    Big data and data analytics is one of the hottest IT themes in both academics and industry worldwide. In this panel, the panelists will present their point of view on key future challenges for Big Data technologies. The discussion will leverage a diverse set of experiences and viewpoints, since the panel includes participants from both the leadership of R & D labs in major corporations and from research groups conducting high-profile, Big Data research projects at academic and government organizations.

    See publication
  • Predictive Analytics for Customer-Centric Commerce at eBay

    Predictive Analytics Innovation Summit, San Diego

  • Data Science Applications with Hadoop at eBay

    Apache Hadoop Innovation Summit, San Diego

  • Astro: A Predictive Model for Anomaly Detection and Feedback-based Scheduling on Hadoop

    IEEE International Conference on Big Data, Santa Clara

  • Building a Personalization Platform with Cassandra

    Cassandra Day Silicon Valley

    Other authors
    See publication
  • Recommendations and Personalization at eBay

    Invited Talk at Google HQ, Mountain View.

  • Engineering a Scalable, Cache and Space Efficient Trie for Strings

    The International VLDB Journal

    Storing and retrieving strings in main memory is a fundamental problem in computer science. The efficiency of string data structures used for this task is of paramount importance for applications such as in-memory databases, text-based search engines and dictionaries. The burst trie is a leading choice for such tasks, as it can provide fast sorted access to strings. The burst trie, however, uses linked lists as substructures which can result in poor use of CPU cache and main memory. Previous…

    Storing and retrieving strings in main memory is a fundamental problem in computer science. The efficiency of string data structures used for this task is of paramount importance for applications such as in-memory databases, text-based search engines and dictionaries. The burst trie is a leading choice for such tasks, as it can provide fast sorted access to strings. The burst trie, however, uses linked lists as substructures which can result in poor use of CPU cache and main memory. Previous research addressed this issue by replacing linked lists with dynamic arrays forming a cache-conscious array burst trie. Though faster, this variant can incur high instruction costs which can hinder its efficiency. Thus, engineering a fast, compact, and scalable trie for strings remains an open problem. In this paper, we introduce a novel and practical solution that carefully combines a trie with a hash table, creating a variant of burst trie called HAT-trie. We provide a thorough experimental analysis which demonstrates that for large set of strings and on alternative computing architectures, the HAT-trie—and two novel variants engineered to achieve further space-efficiency—is currently the leading in-memory trie-based data structure offering rapid, compact, and scalable storage and retrieval of variable-length strings.

    Other authors
    • Nikolas Askitis
    See publication
  • Engineering Burstsort: Towards Fast In-place String Sorting

    ACM J. of Experimental Algorithmics

    Burstsort is a trie-based string sorting algorithm that distributes strings into small buckets whose contents are then sorted in cache. This approach has earlier been demonstrated to be efficient on modern cache-based processors [Sinha & Zobel, JEA 2004]. In this article, we introduce improvements that reduce by a significant margin the memory requirement of Burstsort: It is now less than 1% greater than an in-place algorithm. These techniques can be applied to existing variants of Burstsort…

    Burstsort is a trie-based string sorting algorithm that distributes strings into small buckets whose contents are then sorted in cache. This approach has earlier been demonstrated to be efficient on modern cache-based processors [Sinha & Zobel, JEA 2004]. In this article, we introduce improvements that reduce by a significant margin the memory requirement of Burstsort: It is now less than 1% greater than an in-place algorithm. These techniques can be applied to existing variants of Burstsort, as well as other string algorithms such as for string management.

    We redesigned the buckets, introducing sub-buckets and an index structure for them, which resulted in an order-of-magnitude space reduction. We also show the practicality of moving some fields from the trie nodes to the insertion point (for the next string pointer) in the bucket; this technique reduces memory usage of the trie nodes by one-third. Importantly, the trade-off for the reduction in memory use is only a very slight increase in the running time of Burstsort on real-world string collections. In addition, during the bucket-sorting phase, the string suffixes are copied to a small buffer to improve their spatial locality, lowering the running time of Burstsort by up to 30%. These memory usage enhancements have enabled the copy-based approach [Sinha et al., JEA 2006] to also reduce the memory usage with negligible impact on speed.

    Other authors
    See publication
  • RepMaestro: scalable repeat detection on disk-based genome sequences

    Bioinformatics Journal

    Motivation: We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk resident. Supermaximal repeats, maximal unique matches (MuMs) and pairwise branching tandem repeats…

    Motivation: We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk resident. Supermaximal repeats, maximal unique matches (MuMs) and pairwise branching tandem repeats have been used to demonstrate the practicality of our approach; the first such study to use an enhanced SA to detect these repeats in large genome sequences.

    Results: The detection of supermaximal repeats was observed to be up to two times faster than Vmatch, but more importantly, was shown to scale efficiently to large genome sequences that Vmatch could not process due to memory constraints (4 GB). Similar results were observed for the detection of MuMs, with RepMaestro shown to scale well and also perform up to six times faster than Vmatch. For tandem repeats, RepMaestro was found to be slower but could nonetheless scale to large disk-resident sequences. These results are a significant advance in the quest of scalable repeat detection.

    Other authors
    • Nikolas Askitis
    See publication
  • Reducing space requirements for disk resident suffix arrays

    Database Systems for Advanced Applications

    Suffix trees and suffix arrays are important data structures for string processing, providing efficient solutions for many applications involving pattern matching. Recent work by Sinha et al. (SIGMOD 2008) addressed the problem of arranging a suffix array on disk so that querying is fast, and showed that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered many times faster than alternative disk-based suffix trees. A drawback of their…

    Suffix trees and suffix arrays are important data structures for string processing, providing efficient solutions for many applications involving pattern matching. Recent work by Sinha et al. (SIGMOD 2008) addressed the problem of arranging a suffix array on disk so that querying is fast, and showed that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered many times faster than alternative disk-based suffix trees. A drawback of their LOF-SA structure, and common to all current disk resident suffix tree/array approaches, is that the space requirement of the data structure, though on disk, is large relative to the text – for the LOF-SA, 13n bytes including the underlying n byte text. In this paper we explore techniques for reducing the space required by the LOF-SA. Experiments show these methods cut the data structure to nearly half its original size, without, for large strings that necessitate on-disk structures, any impact on search times.

    Other authors
    See publication
  • OzSort: Sorting 100GB for less than 87kJoules

    Sort Benchmark (Joulesort)


    OzSort is a fast and stable external sorting software specifically optimized for the require- ments of the 2009 PennySort (Indy) benchmark. OzSort can sort over 246GB of data for under a penny, using a standard desktop PC. The data sorted consisted of 100 byte records with (random) keys of length 10 bytes starting from the first byte of each record. In this paper, we apply our OzSort software and its associated hardware components to address the important issue of energy efficiency. By…


    OzSort is a fast and stable external sorting software specifically optimized for the require- ments of the 2009 PennySort (Indy) benchmark. OzSort can sort over 246GB of data for under a penny, using a standard desktop PC. The data sorted consisted of 100 byte records with (random) keys of length 10 bytes starting from the first byte of each record. In this paper, we apply our OzSort software and its associated hardware components to address the important issue of energy efficiency. By using our software and hardware components, we demonstrate that the power con- sumption of OzSort on a general-purpose desktop PC can rival that of CoolSort. OzSort could sort 100GB of data in about 827s using less than 87kJoules — 11,597 records sorted/joule. These results exceed that of CoolSort (Daytona class) in 2007, which used mobile processor technology and a RAID of laptop hard drives.

    Other authors
    • Nikolas Askitis
    See publication
  • OzSort: Sorting over 246GB for a Penny

    Sort Benchmark (Pennysort)

    OzSort is a fast and stable external sorting software that is specifically optimized for the requirements of the PennySort (Indy) benchmark. The software sorts a file containing records of 100
    bytes each, with keys of length 10 bytes, where a key starts from the first byte of each record and
    whose byte-length characters are generated uniformly at random. OzSort can sort over 246GB in
    less than 2150.9 seconds; that is, for less than a penny and extends the previous record set by…

    OzSort is a fast and stable external sorting software that is specifically optimized for the requirements of the PennySort (Indy) benchmark. The software sorts a file containing records of 100
    bytes each, with keys of length 10 bytes, where a key starts from the first byte of each record and
    whose byte-length characters are generated uniformly at random. OzSort can sort over 246GB in
    less than 2150.9 seconds; that is, for less than a penny and extends the previous record set by psort
    in 2008 by an additional 56.7GB. We have used a computer that costs US$439.85 and provides
    a time budget ≥ 3×365×24×3600 seconds. The number of records that could be sorted within the 43985 time budget was 2, 463, 105, 024. OzSort can also be extended to cater to the Daytona class and be able to sort records and keys of varying sizes.

    Other authors
    • Nikolas Askitis
    See publication
  • A Fast Hybrid Short Read Fragment Assembly Algorithm

    Bioinformatics Journal

    The shorter and vastly more numerous reads produced by second-generation sequencing technologies require new tools that can assemble massive numbers of reads in reasonable time. Existing short-read assembly tools can be classified into two categories: greedy extension-based and graph-based. While the graph-based approaches are generally superior in terms of assembly quality, the computer resources required for building and storing a huge graph are very high. In this article, we present Taipan…

    The shorter and vastly more numerous reads produced by second-generation sequencing technologies require new tools that can assemble massive numbers of reads in reasonable time. Existing short-read assembly tools can be classified into two categories: greedy extension-based and graph-based. While the graph-based approaches are generally superior in terms of assembly quality, the computer resources required for building and storing a huge graph are very high. In this article, we present Taipan, an assembly algorithm which can be viewed as a hybrid of these two approaches. Taipan uses greedy extensions for contig construction but at each step realizes enough of the corresponding read graph to make better decisions as to how assembly should continue. We show that this approach can achieve an assembly quality at least as good as the graph-based approaches used in the popular Edena and Velvet assembly tools using a moderate amount of computing resources.

    Other authors
    See publication
  • SHREC: A short-read error correction method

    Bioinformatics Journal

    Motivation: Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read…

    Motivation: Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data.

    Results: We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.

    Other authors
    See publication
  • Improving Suffix Array Locality for Fast Pattern Matching on Disk

    ACM SIGMOD

    The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best…

    The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best alternative disk-based suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.

    Other authors
    See publication
  • Detection of near-duplicate images for web search

    6th ACM international conference on Image and video retrieval

    Among the vast numbers of images on the web are many du- plicates and near-duplicates, that is, variants derived from the same original image. Such near-duplicates appear in many web image searches and may represent infringements of copyright or indicate the presence of redundancy. While methods for identifying near-duplicates have been investi- gated, there has been no analysis of the kinds of alterations that are common on the web or evaluation of whether real cases of near-duplication can in…

    Among the vast numbers of images on the web are many du- plicates and near-duplicates, that is, variants derived from the same original image. Such near-duplicates appear in many web image searches and may represent infringements of copyright or indicate the presence of redundancy. While methods for identifying near-duplicates have been investi- gated, there has been no analysis of the kinds of alterations that are common on the web or evaluation of whether real cases of near-duplication can in fact be identified. In this paper we use popular queries and a commercial image search service to collect images that we then manually analyse for instances of near-duplication. We show that such duplica- tion is indeed significant, but that not all kinds of image alteration explored in previous literature are evident in web data. Removal of near-duplicates from a collection is im- practical, but we propose that they be removed from sets of answers. We evaluate our technique for automatic identifi- cation of near duplicates during query evaluation and show that it has promise as an effective mechanism for manage- ment of near-duplication in practice.

    Other authors
    See publication
  • SICO: A System for Detection of Near-Duplicate Images During Search.

    International Conference on Multimedia and Expo (ICME)


    Duplicate and near-duplicate digital image matching is beneficial for image search in terms of collection manage- ment, digital content protection, and search efficiency. In this paper, we introduce SICO, a novel system for near-duplicate image detection during web search. It accurately detects near- duplicates in the answers returned by commercial image search engines in real-time. We show that SICO — which utilizes PCA-SIFT local descriptors and adapts near-duplicate text document…


    Duplicate and near-duplicate digital image matching is beneficial for image search in terms of collection manage- ment, digital content protection, and search efficiency. In this paper, we introduce SICO, a novel system for near-duplicate image detection during web search. It accurately detects near- duplicates in the answers returned by commercial image search engines in real-time. We show that SICO — which utilizes PCA-SIFT local descriptors and adapts near-duplicate text document detection techniques — is both effective and effi- cient. On a standard desktop personal computer, SICO iden- tifies clusters of near-duplicate images with 93% accuracy in under 30 seconds, for an average of 622 returned images for each query.

    Other authors
    See publication
  • Using redundant bit vectors for near-duplicate image detection

    Advances in Databases: Concepts, Systems and Applications

    Images are amongst the most widely proliferated form of digital information due to affordable imaging technologies and the Web. In such an environment, the use of digital watermarking for image copy- right infringement detection is a challenge. For such tasks, near-duplicate image detection is increasingly attractive due to its ability of auto- mated content analysis; moreover, the application domain also extends to data management. The application of PCA-SIFT features and Locality- Sensitive…

    Images are amongst the most widely proliferated form of digital information due to affordable imaging technologies and the Web. In such an environment, the use of digital watermarking for image copy- right infringement detection is a challenge. For such tasks, near-duplicate image detection is increasingly attractive due to its ability of auto- mated content analysis; moreover, the application domain also extends to data management. The application of PCA-SIFT features and Locality- Sensitive Hashing (LSH) — for indexing and retrieval — has been shown to be highly effective for this task. In this work, we prune the number of PCA-SIFT features and introduce a modified Redundant Bit Vec- tor (RBV) index. This is the first application of the RBV index that shows near-perfect effectiveness. Using the best parameters of our RBV approach, we observe an average recall and precision of 91% and 98%, respectively, with query response time of under 10 seconds on a collection of 20, 000 images. Compared to the baseline (the LSH index), the query response times and index size of the RBV index is 12 times faster and 126 times smaller, respectively. As compared to brute-force sequential scan, the RBV index rapidly reduces the search space to 1/80.

    Other authors
    See publication
  • Pruning SIFT for scalable near-duplicate image matching

    Australasian Database Conference

    The detection of image versions from large image col- lections is a formidable task as two images are rarely identical. Geometric variations such as cropping, ro- tation, and slight photometric alteration are unsuit- able for content-based retrieval techniques, whereas digital watermarking techniques have limited applica- tion for practical retrieval. Recently, the application of Scale Invariant Feature Transform (SIFT) interest points to this domain have shown high effectiveness, but…

    The detection of image versions from large image col- lections is a formidable task as two images are rarely identical. Geometric variations such as cropping, ro- tation, and slight photometric alteration are unsuit- able for content-based retrieval techniques, whereas digital watermarking techniques have limited applica- tion for practical retrieval. Recently, the application of Scale Invariant Feature Transform (SIFT) interest points to this domain have shown high effectiveness, but scalability remains a problem due to the large number of features generated for each image. In this work, we show that for this application domain, the SIFT interest points can be dramatically pruned to effect large reductions in both memory requirements and query run-time, with almost negligible loss in ef- fectiveness. We demonstrate that, unlike using the original SIFT features, the pruned features scales bet- ter for collections containing hundreds of thousands of images.

    Other authors
    See publication
  • HAT-trie: a cache-conscious trie-based data structure for strings

    Thirtieth Australasian conference on Computer science

    Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not how- ever, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components. We evaluate performance using several real-world…

    Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not how- ever, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components. We evaluate performance using several real-world datasets and against other high- performance data structures. We show strong im- provements in both time and space; in most cases ap- proaching that of the cache-conscious hash table. Our HAT-trie is shown to be the most efficient trie-based data structure for managing variable-length strings in-memory while maintaining sort order.

    Other authors
    • Nikolas Askitis
    See publication
  • Discovery of image versions in large collections

    Advances in Multimedia Modeling

    Image collections may contain multiple copies, versions, and fragments of the same image. Storage or retrieval of such duplicates and near-duplicates may be unnecessary and, in the context of collections de- rived from the web, their presence may represent infringements of copy- right. However, identifying image versions is a challenging problem, as they can be subject to a wide range of digital alterations, and is poten- tially costly as the number of image pairs to be considered is quadratic…

    Image collections may contain multiple copies, versions, and fragments of the same image. Storage or retrieval of such duplicates and near-duplicates may be unnecessary and, in the context of collections de- rived from the web, their presence may represent infringements of copy- right. However, identifying image versions is a challenging problem, as they can be subject to a wide range of digital alterations, and is poten- tially costly as the number of image pairs to be considered is quadratic in collection size. In this paper, we propose a method for finding the pairs of near-duplicates based on manipulation of an image index. Our approach is an adaptation of a robust object recognition technique and a near-duplicate document detection algorithm to this application domain. We show that this method requires only moderate computing resources, and is highly effective at identifying pairs of near-duplicates.

    Other authors
    See publication
  • Clustering Near-duplicate Images in Large Collections

    Multimedia Information Retrieval

    Near-duplicate images introduce problems of redundancy and copyright infringement in large image collections. The problem is acute on the web, where appropriation of im- ages without acknowledgment of source is prevalent. In this paper, we present an effective clustering approach for near- duplicate images, using a combination of techniques from invariant image local descriptors and an adaptation of near- duplicate text-document clustering techniques; we extend our earlier approach of…

    Near-duplicate images introduce problems of redundancy and copyright infringement in large image collections. The problem is acute on the web, where appropriation of im- ages without acknowledgment of source is prevalent. In this paper, we present an effective clustering approach for near- duplicate images, using a combination of techniques from invariant image local descriptors and an adaptation of near- duplicate text-document clustering techniques; we extend our earlier approach of near-duplicate image pairwise iden- tification for this clustering approach. We demonstrate that our clustering approach is highly effective for collections of up to a few hundred thousand images. We also show — via experimentation with real examples — that our approach presents a viable solution for clustering near-duplicate im- ages on the Web.

    Other authors
    See publication
  • Detection of Image Versions For Web Search

    ACM International Conference on Image and Video Retrieval

  • Cache-efficient string sorting using copying

    ACM J. of Experimental Algorithmics

    Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in out-of-cache references. In this paper, we introduce C-burstsort…

    Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in out-of-cache references. In this paper, we introduce C-burstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CP-burstsort, uses more memory, but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings.

    Other authors
    See publication
  • Using random sampling to build approximate tries for efficient string sorting

    ACM Journal of Experimental Algorithmics (JEA)

    Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string-sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache efficient. Burstsort dynamically builds a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of…

    Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string-sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache efficient. Burstsort dynamically builds a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce, by up to 37%, cache misses further than did the original burstsort, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained.

    Other authors
    See publication
  • Using compact tries for cache-efficient sorting of integers

    Workshop of Experimental and Efficient Algorithms (WEA)

    The increasing latency between memory and processor speeds has made it imperative for algorithms to reduce expensive accesses to main memory. In earlier work, we presented cache-conscious algorithms for sorting strings, that have been shown to be almost two times faster than the previous algorithms, mainly due to better usage of the cache. In this paper, we propose two new algorithms, Burstsort and MEBurstsort, for sorting large sets of integer keys. Our algorithms use a novel approach for…

    The increasing latency between memory and processor speeds has made it imperative for algorithms to reduce expensive accesses to main memory. In earlier work, we presented cache-conscious algorithms for sorting strings, that have been shown to be almost two times faster than the previous algorithms, mainly due to better usage of the cache. In this paper, we propose two new algorithms, Burstsort and MEBurstsort, for sorting large sets of integer keys. Our algorithms use a novel approach for sorting integers, by dynamically constructing a compact trie which is used to allocate the keys to containers. These keys are then sorted within the cache. The new algorithms are simple, fast and efficient. We compare them against the best existing algorithms using several collections and data sizes. Our results show that MEBurstsort is up to 3.5 times faster than memory-tuned quicksort for 64-bit keys and up to 2.5 times faster for 32-bit keys. For 32-bit keys, on 10 of the 11 collections used, MEBurstsort was the fastest, whereas for 64-bit keys, it was the fastest for all collections.

    See publication
  • Using random sampling to build approximate tries for efficient string sorting

    Workshop of Experimental and Efficient Algorithms (WEA)

    Algorithms for sorting large datasets can be made more effi- cient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we…

    Algorithms for sorting large datasets can be made more effi- cient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we in- troduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimen- tal results with sets of over 30 million strings show that the new vari- ants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained.

    Other authors
    See publication
  • Cache-conscious sorting of large sets of strings with dynamic tries

    ACM J. of Experimental Algorithmics

    Ongoing changes in computer architecture are affecting the efficiency of string-sorting algorithms. The size of main memory in typical computers continues to grow but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-sorting algorithms as they do not utilize cache well for large data sets. We propose a new sorting algorithm for strings, burstsort, based on dynamic construction of a compact trie in which strings are…

    Ongoing changes in computer architecture are affecting the efficiency of string-sorting algorithms. The size of main memory in typical computers continues to grow but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-sorting algorithms as they do not utilize cache well for large data sets. We propose a new sorting algorithm for strings, burstsort, based on dynamic construction of a compact trie in which strings are kept in buckets. It is simple, fast, and efficient. We experimentally explore key implementation options and compare burstsort to existing string-sorting algorithms on large and small sets of strings with a range of characteristics. These experiments show that, for large sets of strings, burstsort is almost twice as fast as any previous algorithm, primarily due to a lower rate of cache miss.

    Other authors
    See publication
  • Efficient trie-based sorting of large sets of strings

    26th Australasian computer science conference

    Sorting is a fundamental algorithmic task. Many general-purpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a trie-based algorithm for sorting strings, is for large data sets more efficient than all previous algorithms for this task. In this paper we re-evaluate some of the implementation details of burstsort, in particular the method for…

    Sorting is a fundamental algorithmic task. Many general-purpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a trie-based algorithm for sorting strings, is for large data sets more efficient than all previous algorithms for this task. In this paper we re-evaluate some of the implementation details of burstsort, in particular the method for managing buckets held at leaves. We show that better choice of data structures further improves the efficiency, at a small additional cost in memory. For sets of around 30,000,000 strings, our improved burstsort is nearly twice as fast as the previous best sorting algorithm.

    Other authors
    See publication
  • E-commerce Personalization at Scale

    23rd International Conference on Information and Knowledge Management (CIKM), Shanghai

Patents

  • Generating personalized user recommendations using word vectors

    Issued 11176145

    In various example embodiments, a system and method for constructing and scoring word vectors between natural language words and generating output to a user in the form of personalized recommendations are presented.

    Other inventors
    See patent
  • Personalization Platform

    Issued US 20190057158

    A personalization system includes a user events module configured to receive a plurality of user events, each user event of the plurality of user events including one or more of a transactional event and a behavioral event associated with the online user, and a personalization cluster including a plurality of personalization servers, each personalization server of the plurality of personalization servers configured to receive a personalization request from a requesting system, the…

    A personalization system includes a user events module configured to receive a plurality of user events, each user event of the plurality of user events including one or more of a transactional event and a behavioral event associated with the online user, and a personalization cluster including a plurality of personalization servers, each personalization server of the plurality of personalization servers configured to receive a personalization request from a requesting system, the personalization request including a plurality of intermediate results identified by the requesting system, each intermediate result representing a possible outcome that may be presented by the requesting system to the online user, compute a score for each intermediate result of the plurality of immediate results based at least in part on the plurality of user events, thereby generating a plurality of scores, and return the plurality of scores to the requesting system.

    Other inventors
    See patent

Courses

  • Effective Executive Speaking by American Management Association, 2015

    -

  • Generative AI: Foundation Models and Platforms (IBM and Coursera)

    -

  • Generative AI: Introduction and Applications

    -

  • Generative AI: Prompt Engineering Basics (IBM and Coursera)

    -

  • IBM Quantum Conversations Badge, 2020

    -

Honors & Awards

  • IBM Champion Learner Gold

    IBM

    Top 4% of all learners at IBM in 2024.

  • IBM Tech 2023

    IBM

    This is an inaugural award to recognize the accomplishments of IBM’s premier technical contributors and a testament to the hard work throughout 2022 to drive innovation, transform culture and accelerate growth.

  • IBM Fellow

    IBM

    An IBM Fellow is a position at IBM appointed by the CEO. Fellow is the highest honor a scientist, engineer, or programmer at IBM can achieve. As of April 2022, only 331 IBMers have earned the IBM Fellow distinction.

  • Panel: Adapting New Skills for the 4th Industrial Revolution | Coursera

    -

    Andrew Ng (Coursera Founder), Jeff Maggioncalda (Coursera CEO), Emily Glassberg Sands (Head of Data Science), and Ranjan Sinha (CTO and VP for IBM Global Chief Data Office) discuss with Bay Area leaders how forward-looking companies are preparing to educate and adapt new skills for the “4th Industrial Revolution.”

    Topics include:
    -The critical need of data literacy
    -Evolving your talent to harness the data revolution
    -Developing new skills to capitalize on the data…

    Andrew Ng (Coursera Founder), Jeff Maggioncalda (Coursera CEO), Emily Glassberg Sands (Head of Data Science), and Ranjan Sinha (CTO and VP for IBM Global Chief Data Office) discuss with Bay Area leaders how forward-looking companies are preparing to educate and adapt new skills for the “4th Industrial Revolution.”

    Topics include:
    -The critical need of data literacy
    -Evolving your talent to harness the data revolution
    -Developing new skills to capitalize on the data revolution
    -Group discussion on challenges and insights

  • Panel: Data Drives Efficiency

    Data Business Congress - IAM and GDR

    Creating value in your business practices and processes from data – and the mechanisms you need to put in place to ensure that you can do so as effectively as possible.
    · Identifying the good stuff
    · Understanding the options
    · Working with third party advisers

  • IBM Super Learner

    IBM

    Top 10% of learners at IBM.

  • Top Alumnis

    RMIT University

    Profiled in the RMIT Alumni Celebration Wall amongst 10 notable alumnis.

  • Critical Talent Award

    eBay Inc.

    eBay's prestigious critical talent award is given to the top 1% of performers.

  • Patent Award

    eBay Inc.

  • Engineering Quality Award

    eBay Inc.

    Awarded by eBay's CEO, John Donahue.

  • Poster Award

    eBay Inc.

    Thanks to the entire team and our dynamic research intern, Puja Das.

  • Joulesort Benchmark Medal

    Sort Benchmark

    Sorted 100GB of data in about 827s using less than 87kJoules — 11,597 records sorted/joule. The sort benchmarks started in 1985 and defined, sponsored and administered by Jim Gray (Turing award winner).

  • Pennysort Benchmark Medal

    Sort Benchmark

    Sorted over 246GB in less than 2150.9 seconds; that is, for less than a penny and extends the previous record by an additional 56.7GB. The sort benchmarks started in 1985 and defined, sponsored and administered by Jim Gray (Turing award winner).

  • Early Career Researcher Grant

    University of Melbourne

    Chief Investigator for "System for Large Scale Sequence Analysis in Genomic Databases".

  • ARC Discovery Project Fellowship

    Australian Research Council

    Chief Investigator for "In-memory Sorting, Searching and Indexing on Modern Multi-core Cache-based and Graphics Processor Architectures". The Discovery Projects scheme provides federal funding for excellent fundamental research projects and are highly coveted and competitive.

  • Academic Excellence in the Doctor of Philosophy (PhD) Program

    RMIT University (Microsoft prize)

  • Asia-Pacific Young Inventors Awards

    Asian Wall Street Journal and Hewlett-Packard

    Appeared on a center-page article on Asia’s Cutting-Edge Crusaders. Organized by Asian Wall Street Journal and Hewlett Packard Enterprises.

  • Academic Excellence in Master of Technology

    RMIT University

  • Quality Award

    eBay Inc.

  • Shining Star Award

    eBay Inc.

  • Spot Award

    eBay Inc.

Test Scores

  • GRE Quantitative Reasoning (Mathematics)

    Score: 800/800

Organizations

  • Bio Supply Management Alliance

    Technical Board, Digital Transformation

    - Present

    To help build an effective and efficient supply chain strategy for the biotech, biopharma & biomedical device industries by developing, and disseminating best practices, knowledge, & research. https://biosupplyalliance.com

  • IBM Academy of Technology

    Member

    - Present

    https://www.ibm.com/blogs/academy-of-technology/

  • American Computing Machinery

    Member

    - Present
  • IEEE

    Member

    - Present
  • IEEE Computer Society Technical Committee on Data Engineering

    Member

    - Present
  • Citizen's Climate Lobby

    Member

    - Present
  • Forbes Technology Council

    Member

    - Present
  • Business Performance Innovation Network

    Member

    - Present

More activity by Ranjan

View Ranjan’s full profile

  • See who you know in common
  • Get introduced
  • Contact Ranjan directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Ranjan Sinha, Ph.D.

Add new skills with these courses