June 22, 2009

early case assessment - immediate reports and more reports

Author: admin - Categories: Case Assessment, Litigation Holds, MetaData, eDiscovery, filter and cull - Tags: , , ,
Over 40 early case assessment reports from YOUR data

Over 40 early case assessment reports from YOUR data

 

The following is a brief introduction to the reports available in earlyCASE.

 

1.       Assumptions – Outlines the assumptions about the data, the cost to process & review it.   The assumptions directly impact many of the reports.  If you are going to conduct what if scenarios with different sets of assumptions, make sure you print the assumptions report and attach it to the reports generated based on those assumptions.

2.       Matter – This report is the information you entered on the first screen when you created a new matter.   This information can be updated and does affect the other reports.    The amount of anticipated raw data (ESI) and how long you have to review it are pivotal to the other reports.

3.       File Type Summary – This report will show you all of the file types seen in this ESI (including email attachments), a description of the file type, how many there are, the aggregate size of all of the files of this type, and the date range (last modified) of these files.   The report is sorted by the count with the largest (highest #) first.   This report should also be used in conjunction with the office files report.

4.       De-Dup – File Type Summary - (Professional Run Required)  This report is very similar to the file type summary – BUT with shows the counts and sizes based on a global deduplication of the documents.   It will show you all of the file types seen in this ESI (including email attachments), a description of the file type, the DE Duplicated count of how many there are, the aggregate size of the DE Duplicated files of this type, and the date range (last modified) of these files.   The report is sorted by the count with the largest (highest #) first.   This report also includes totals at the report of the report – Count and Size.

5.       File Date Summary – This can be a very long report in that it shows by date and time the number of documents along with the aggregate size of the documents for that date time.   This report is useful when you are looking for a specific document based on a narrow date range.  

6.       File Author Summary – This report relies on the metadata in the documents and will summarize by document author how many documents they authored, the aggregate size and the date range.  For this report to be useful the applications installed / configured on the machine which generated the documents must have been set up with the users name and not some generic information.    If you use this report make sure you validate it, as many machines do not have this information configured and hence it is not in the document metadata.

7.       Duplicates (Custodian)(Professional Run Required)  This report provides a summary by custodian of the number of duplicate emails and duplicate files.   If both MD5 and SHA1 were generated it will show the counts by hash type as well.   In addition this report will summarize across all of the custodians the total number of duplicate emails and duplicate files.   For more information on how we determine duplicate emails please refer to this document on our website:   https://www.earlycase.com/resources/earlyCASE%20Detecting%20Duplicate%20Emails.pdf

8.       Duplicates (Global) - (Professional Run Required)  This report provides a summary across all of the custodians of the number of duplicate emails and duplicate files.   If both MD5 and SHA1 were generated it will show the counts by hash type as well.   In addition this report lists the duplicate documents with there MD5 hash and the count of how many duplicates there are.   This section is sorted by the count descending (largest # first) and is useful to spot check the duplicates as well as see what documents were duplicated the most.   For more information on how we determine duplicate emails please refer to this document on our website:   https://www.earlycase.com/resources/earlyCASE%20Detecting%20Duplicate%20Emails.pdf

9.       Image Summary –This report shows the types of images (pictures, etc) along with the aggregate size and date range.   Depending on the matter pictures (images) may not be useful to process, OCR, and review.    This report identifies the types and counts to assist in the budget and planning process of dealing with them if needed.

10.   Warnings & Errors – Any warnings of errors encountered in the processing of the ESI by earlyCASE will be reported here.   It is a good idea to check this report.   Password protected files, corrupt files, unsupported file compression, all are reported here.

11.   Budgets & Timelines  - This report contains the summary of the amounts of data, applies the assumption to this data and calculates the budgets and time required to handle the sample analyzed, but also extrapolates what the larger population of data will look like based on the sample.   This report is an excellent report to use when changing assumptions and effect that those changed assumption have on the project budget.   This report should be used in conjunction with the sizes and counts expanded report.

12.   26(f) report – This report is a Rollup of 6 of the other reports in earlyCASE.   The intent of this report is to provide a snapshot of the data in a matter, the assumptions, etc.  – all without disclosing anything that would be privileged or confidential.  As such this report is careful to only include summary type information useful in communicating and negotiating with the other side in a matter about the ESI / eDiscovery.

13.   Email “To:” summary – This report shows you who the custodians have been sending email “TO”, the counts and the date ranges of those emails.   It can be useful in identifying additional custodians as well in being able to justify eliminating a custodian from what is processed and reviewed.  

14.   Email “From:” summary - This report shows you who the custodians have been receiving  email “FROM”, the counts and the date ranges of those emails.   It can be useful in identifying additional custodians as well in being able to justify eliminating a custodian from what is processed and reviewed.

15.   Email Dates summary – Summaries by Year and Month the number of emails and their aggregate size.   It is useful in determining date cutoffs based on the ESI you have against the request.    Allows you to quickly isolate blocks of emails by data that are clearly outside the date range at issue.

16.   Conversation Summary – This report summarizes the number of emails (including responses and forwards) in an email thread as well as the date range of the messages in that thread.    This report should generally NOT be provided to the other side in a matter as the subject lines may contain privileged or confidential information.

17.   Duplicate emails(Professional Run Required)  This report can be fairly long in a large population of emails.   It shows the MD5 hash of the email message, the sent date, the subject and provides totals at the end of the report.   This report should generally NOT be provided to the other side in a matter as the subject lines may contain privileged or confidential information.   For more information on how we determine duplicate emails please refer to this document on our website:   https://www.earlycase.com/resources/earlyCASE%20Detecting%20Duplicate%20Emails.pdf

18.   PST  & NSF Analyzed  - (Professional Run Required for NSF)  This report will show you a summary of the types of email containers processed,  there aggregate size, how many emails where processed from those containers and the aggregate size of the messages.   This report also shows by custodian and container how many messages and the aggregate size of the messages.   Lotus Notes (NSF) processing requires that a Professional earlyCASE run was done.

19.   Other EMail Containers – This report identifies less common email container files that are not processed by earlyCASE and that may require additional work to extract and review the messages.   The objective of this report is make you aware that you have some uncommon / less common email mailboxes in the population of data.

20.   Summary of Container Files - (Professional Run Required)  This report summarizes by custodian the containers observer well as summarizing the container files (by type), provides a description of that container type along with the number of that type of container, the aggregate size of the container, and the date (last modified date) ranges of those containers.   Totals of all the containers are also on this report.

21.   Container File Details - (Professional Run Required)  This report summarizes by custodian the containers observer well as providing details on what was extracted from each container.

22.   Encase, DD, AFF Image Info - (Professional Run Required)  This report details the drive images seen in the data that was processed by earlyCASE, the size of the image, the date of the image and the number of files that the image contains.   It also provides the details about the image  - when it was created, by who, the machine information, and any notes entered by the examiner when the image was created.

23.   Data Collection Summary – This report is show the details on drive images that were created using the new earlyCASE “Collect” feature.   This includes the Hash values of the images, the machine information, who and when the image was created, etc.

24.   Sizes and Counts Expanded – This report provides a summary view of how the data expanded both in size and counts.  It breaks this down by run and by piece of media.    This report also identifies the number and percentage of what was on that media that was a duplicates of things already processed.   Included on this report is a summary of the containers, what was extracted from the containers and a anticipated size of what you would pay to process and review based on the actual data you analyzed.

25.    Top 25 File Types – Generally the most common file types seen in a population of documents make up the majority of the count as well as size.   This report identifies the most frequent file types, the count, the aggregate size along with totals for what the top 25 file type represent in count, size and percentage of the total document population.   It is not uncommon to see the top 25 file types represent over 85% of the total population and it a good starting place to understand both the loose files as well as the files types that were attached to emails.

26.   Top 25 File Dates – The top 25 file dates report summarizes the population of documents by year and month and order the list based on the year / month with the most documents.   This is useful in understanding the distribution of the documents by date.   It be useful in seeing how the data you have relates to the dates of interest in the matter.

27.   Office File Types – This report looks at just the “Common” office types of documents against the larger population of documents and provides counts,  and aggregate sizes by file type for the normal Office Document types.   This report should be used in conjunction with the Top 25 file types report to form a picture of what files are predominant and meaningful.

28.   Other Costs and Expenses – This report takes the cost assumptions and applies it to the data that was analyzed to estimate the costs in this area for both the analyzed sample as well as the larger (extrapolated) set of data.    These costs include items like data collection, litigation support, project management, hosting costs, etc.

29.   Backup Tape Costs – Backup tape related costs are reported here and are NOT rolled into the budget for a project as generally backup tape is treated as inaccessible information.   This report identifies these costs based on the assumptions provided.   If backup tapes are to be handled in a matter these costs need to be manually added into your budget.

30.   Extrapolated Costs – This report extrapolates the size and counts of the larger data set size based on the sample that was analyzed.   It applies the filter cull assumptions provided and extrapolates the anticipated costs pre and post filtering for the larger set of data anticipated.

31.   Attorney Review Cost –This report shows the costs anticipated on reviewing the extrapolated set of data.   This report uses the assumption provided to arrive at cost of attorney review as well as the number of full time attorneys that would be required to complete the review in the time frame specified.

32.   Generally Included File Types – This report compares the file types observed in the data analyzed against a database (Filetypes) to summarize what the sizes and counts of the generally reviewed file types would look like.   This differs from the top 25 file types and office file types in that you can customize the file types which show on this report

33.   Generally Excluded File Types - This report compares the file types observed in the data analyzed against a database (Filetypes) to summarize what the sizes and counts of file types which are generally NOT processed or reviewed.   This differs from the top 25 file types and office file types in that you can customize the file types which show on this report

34.   Unknown File Types – This report summarizes the file types, counts and sizes for any file type which is unknown / not defined in the FileTypes table.   This report should be checked to see if there are file types which the generally included  and generally excluded file types reports did not pickup.

35.   Compare Run 1 and Run 2 –This report compares the files and emails analyzed in the first and second run of a matter to summarize what the differences are in the 2 runs.   This is useful when you have 2 images of the same hard drive and you need to understand what (if anything) has changed.   It also is helpful to validate a drive image against the source drive to insure that the image has everything.

36.   Folders > 100 Files – This report shows the folders (and path) of any folder which has over 100 files in it.   This is useful in checking for temporary locations, system storage which may be in the collection of data but really does not need to be processed and reviewed.    This report should be used with the Folder Inventory ALL report to form a complete picture of where documents were stored and folders / paths that can be filtered out.

37.   Folder Inventory ALL - This report shows the counts by folders (and path) and is ordered by the folder with the most files to the folders with the least files.  This is useful in understanding where the bulk of the files came from and potentially identifying paths which really do not need to be processed and reviewed.    This report can be pretty long, and you may want to focus on just the first 3 or 4 pages of this report.

38.   File Inventory with MD5 – This report is useful when you intend to turn over a population of native files to another party and you want to provide a complete inventory of what you are giving them which includes the hash values of every document.    This report will be very long!

39.   Charts and Graphs – earlyCASE provides charts and graphs of the file types, containers and other key pieces of information displayed in chart and graph form.  

40.   File Type Pivot – Pivot tables allow you to select / deselect filetypes which are then charted or visible in table form.   This is useful in seeing the impact size and count of filtering file types from the population.    This is an advanced Microsoft Excel Function and is well worth learning more about.

41.   File Date Pivot - Pivot tables allow you to select / deselect date ranges which are then charted or visible in table form.   This is useful in seeing the impact size and count of filtering date ranges  from the population.    This is an advanced Microsoft Excel Function and is well worth learning more about.

42.   Export Email and File tables to CSV – Use this feature to export the data in the access database to standard CSV files.   The Email Details, File Details, and Folder Details are exported with this option.

43.   Microsoft Excel Export - Use this feature to export the data in the access database to standard Microsoft Excel Spreadsheets.   The Email Details, File Details, and Folder Details are exported with this option. Be aware that MS Excel has a limit if 64,000 rows, so larger matters will overflow the limits of Excel and you should use the Export to CSV report.

44.   XML Export of Data – Users who have installed a full copy of Microsoft Access 2007 have the option of exporting the data into XML (instead of CSV).   Because XML is very verbose you should only use this if the CSV option will not work for you.

 

 

 

 

Alphabetical List of the currently available reports:

 


26(f) report

Assumptions

Attorney Review Cost

Backup Tape Costs

Budgets & Timelines

Charts and Graphs

Compare Run 1 and Run 2

Container File Details

Conversation Summary

Data Collection Summary

De-Dup – File Type Summary

Duplicate emails

Duplicates (Custodian)

Duplicates (Global)

Email “From:” summary

Email “To:” summary

Email Dates summary

Encase, DD, AFF Image Info

Export Email and File tables to CSV

Extrapolated Costs

File Author Summary

File Date Pivot

File Date Summary

File Inventory with MD5

File Type Pivot

File Type Summary

Folder Inventory ALL

Folders > 100 Files

Generally Excluded File Types

Generally Included File Types

Image Summary

Matter

Microsoft Excel Export

Office File Types

Other Costs and Expenses

Other EMail Containers

PST  & NSF Analyzed

Sizes and Counts Expanded

Summary of Container Files

Top 25 File Dates

Top 25 File Types

Unknown File Types

Warnings & Errors

XML Export of Data


 

 

 

October 7, 2008

eDiscovery - security, risk, priviledge and case assessment

Author: admin - Categories: Case Assessment, eDiscovery - Tags: , , , ,

earlyCASE uses you local machine to process the documents and they are NOT transmitted to any other servers or networks.    Because of this you can use earlyCASE in situations where privacy concerns or confidentiality concerns exist.    In addition to the added benefit of security because the data is analyzed locally, processing time is very fast since the documents do not have to be transmitted or copied to a remote server or computer. 

Your Account:

When you create an account, it is your’s alone.   You set the password, maintain your own profile, etc.   Because accounts are FREE we encourage users to NOT SHARE there account with anyone.   If others within your organization have need to use earlyCASE, they can create an account and be up and running in a matter of minutes.

The earlyCASE Application:

earlyCASE is a new class of Internet application which Microsoft calls a “Rich Internet Application” or “Click Once Application”.    Once you open the application from the www.earlyCASE.com site it is installed and has local access to read and analyze data you browse to with earlyCASE.    You can analyze any data with earlyCASE that your computer can read.     When you run earlyCASE for the first time, Windows will prompt you with a security message to get your permission to run and install earlyCASE on the machine.    The installation process only takes a minute or two.

For more information on Applications build using Microsoft’s Click Once development methodolgy

Trusted Site:

It is recommended the you add www.earlyCASE.com to the sites that your browser has listed as trusted.   This will use the Interent Security settings for trusted sites when running earlyCASE.  

For more information on allowing a site to be a “Trusted Site”

 

Requirements:

For earlyCASE to run you must have the Microsoft .NET framework 3.0 (or later) installed on your machine.   This is a FREE component from Microsoft, and is generally already installed on most PC’s.   IF It is NOT already installed on your PC  click on the following link to install the .NET Framework.

http://www.microsoft.com/downloads/details.aspx?FamilyID=333325FD-AE52-4E35-B531-508D977D32A6&displaylang=en

View Tom. Strack @ earlycase.com's profile on LinkedIn

early case assessment and foreign languages

Author: admin - Categories: Case Assessment, eDiscovery - Tags: , , , ,

earlyCASE was built from the ground up to support multiple languages (including Asian languages) and character sets.   You will hear this type of support often referred to as Unicode and/or  double-bit characters.

Internal to the earlyCASE application(s) as well as the database structure is full support for all single and double bit character sets.   Analyzing foreign language content is not a problem for earlyCASE.

Where users generally have issues is with the Character sets that they have installed on the machine they are running windows on.   Without certain character sets installed Microsoft windows does not have the correct characters / logic to display other languages…. At least not until you tell the operating system to install these languages.

GOOD PRACTICE:   Install the Language Interface Packs on your machine BEFORE you analyze data which you know or suspect are in a language other than English.  

The reason for installing the language interface packs in advanced  is actually quite simple,  when you are selecting files and folders document names created and stored will often look like gibberish.   With the languages installed, you make not be able to understand what they say, but you can at least read them.

EVEN if you do not install the language interface packs, earlyCASE will still analyze the documents correctly and you can install the language packs at a later time when you are ready to access the local database that earlyCASE creates  for you.

Some Resources to assist you in getting language packs installed:

·         Asian Languages

·         Languages Supported by Windows

·         Installing Multiple Language Support for your Operating System

·         Installing Language Support for Microsoft Office

The method and location of Language installation varies by version of the Windows that you are running.   Generally these are located by opening the Control Panel within Windows and locating “Regional and Language Options” (or similar option) and opening this.  There will be a Languages Tab which will allow you to add additional Language support to your machine.  

Dates and Measures:

The Dates and Units of measure represented within earlyCASE are all in the normalized display used in the United States.    Dates are formatted as   MM/DD/YYYY   This is required to be able to report across a collection regardless of where the ESI originated from.

GOOD PRACTICE:   DO NOT CHANGE your input language,  simply install additional languages to give you access to the associated character sets.

View Tom. Strack @ earlycase.com's profile on LinkedIn

How does early case assessment for eDiscovery work

Author: admin - Categories: Case Assessment, eDiscovery - Tags: , , , ,

Q:           What does earlyCASE analyze?

A:            earlyCASE looks at the file metadata as well as the internal metadata within office documents and emails and organizes it into a database and reports.   This includes processing Microsoft PST’s, extracting and analyzing email attachments, etc.   This information is in a form that you can easily see the dates, types, custodians, etc. related to the data analyzed.    The professional analysis builds on top of this with generating hash values for all of the documents and emails and gives you visibility to duplicate documents, emails and attachments.

Q:           How can you offer this for FREE?

A:            The cost of earlyCASE is partially covered by the contribution of our sponsors.   There message and content (ads) are shown inside of the earlyCASE application while your ESI is being analyzed.   Links to these sponsors are shown is various places on the reports that are created.  

Q:           What is the “Professional” version?

A:            The professional analysis includes extracting and analyzing the contents of containers (Zip files, etc.), Generated Hash values for the files, emails, etc. and isolates the documents that are exact duplicates.   With the professional analysis there are also 8 additional reports created.  

Q:           Can I add data to a matter?

A:            Yes,  you have the option of adding (selecting files and folders) at several places in the earlyCASE application.    When you add data, the additional data is Analyzed (ie.  Run) and the results are added to the database and associated reports for that matter.

Q:           How long does it take to run the Analysis?

A:            Because the documents, emails, etc.  NEVER leave your computer the analysis is very fast.   The earlyCASE application comes to the data.   The metadata is read and added to the database; files are not moved or copied during this process.   The faster your computer, the faster it runs.     Generally speaking the analysis takes about 7 minutes per Gigabyte (about 8 Gigabytes per hour).

Q:           How does earlyCASE handle foreign languages?

A:            earlyCASE use Unicode type for all data and can handle all languages, including Asian languages.    To view some of these characters you will need to install the language packs to be able to view and print these characters.  Even if you have not installed these languages, the information is correctly extracted from the metadata and stored in the database. 

 

Q:           Is any confidential or privileged information on any of the reports?

A:            No, the reports are all summary in nature and none of the details of any of the documents or emails are on the reports.    What you do with the reports is totally up to you, none of the document details or the reports are stored on our servers.

Q:           What do you store on your servers?

A:            We keep track of when you logon, the basic matter information and some usage stats on what was analyzed to help us track how people are using earlyCASE, the speed, etc.   Again, NONE of the document details or the reports are written to or stored on our servers.  

Q:           Can you run it from other countries or where privacy is an issue?

A:            Yes,   by design earlyCASE can be run from anywhere in the world that Internet access is available.  Because NONE of the documents are transmitted anywhere and the application is very compact (installs quickly) so even a slow Internet connection is OK.  

Q:           What is the difference between earlyCASE analysis and Electronic Discovery (E-Discovery)?

A:            E-Discovery generally adds several additional steps beyond what earlyCASE does.   Full text extraction, password cracking (protection removal), converting native documents to Tiff or PDF images are often features that are considered part of the E-Discovery process?    earlyCASE focuses on organizing the data in a form that you can establish clear boundaries of what should be filtered or culled out before the data is subjected to E-Discovery processing and attorney review. 

Q:           What types of HASH values are generated?

A:            The professional analysis gives you the option of generating a MD5 Hash or a SHA1 hash for use in detecting duplicate documents.   The MD5 (128 bit) method is more commonly used and takes less time to generate than the SHA1 hash (160 bit). 

Q:           Can I access the Metadata and document details?

A:            Yes,  the metadata and document details are stored in a Microsoft Access database that is created on the machine that you ran earlyCASE from.    You will need to have Microsoft Access 2003 or later installed to open and access this information.    Internally, the database has reports and forms already created for you to be able to immediately access and understand the database.   You can create your own reports if you need to, as well as export data from this database to Excel or other applications.

 

 

Q:           If I run the Basic Analysis, can I upgrade to Professional Analysis after the fact?

A:            No, some of the Professional analysis processes are not run when you elect the Basic Analysis.   To switch to the Professional analysis you will need to re-run the data electing the Professional analysis at run time.

Q:           How are Microsoft PSTs handled?

A:            earlyCASE has the ability to read into PST without disturbing it, extracting metadata on messages as well as attachments to email messages.    The Basic Analysis fully analyzes Microsoft PST contents.   If you elect to run the Professional Analysis, you get to see and understand how many duplicate email messages and attachments as well.

Q:           Are Lotus Domino / Notes NSFs handled?

A:            NSF files are shown on the file type summary as well as the container reports.  Presently earlyCASE does not extract the metadata and attachment information for the contents of Lotus Notes / Domino NSF files.   Handling NSF is currently in development   and when released will be included in the Professional analysis.    

Q:           How are Zip files handled?

A:            Zip files along with 17 other compressed types of containers are handled when the Professional Analysis is run.   This process expands the compressed containers and extracts the metadata for the documents contained in the container.    If the container contains other compressed containers, the process recourses thru them until all of the containers have been expanded and analyzed.   Keep in mind that container files are only processed when you elect the Professional Analysis.

Q:           What is the 26(f) report?

A:            The 26(f) report is a consolidation of the key summary reports and budget information that is useful in reviewing the nature of the ESI you have and negotiating clearer line for filtering and culling the data.   The 26(f) report DOES not include any information that would be considered privileged or confidential.

 View Tom. Strack @ earlycase.com's profile on LinkedIn

Getting a Handle on all of those duplicate emails and documents

Author: admin - Categories: Case Assessment, MetaData, eDiscovery - Tags: , ,

The technology surrounding the detection of duplicate documents, emails, attachments, etc. is a very mature science at this point and can be relied upon to do a very good job at isolating duplicates.    The two keys areas fall into “Exact” duplicates,  and “Near” duplicates.  

EXACT duplicates:

Detecting an “Exact” duplicate involves generating a hash or message digest from the source document, email etc. and storing this calculated # in a database.   Every document has a Hash calculated for it, then that hash is compared to the other Hash values in the database.   If the Hash exist more than once,   it’s an exact duplicate.

earlyCASE uses the MD5 Hash algorithm to identify duplicates.  At 128bits in length it can be calculated quickly and does a good job of creating reliable uniqueness to base the duplicate detection process on.   When running in Professional Analysis mode you may elect to calculate and store SHA-1 hashes (160bit) and use these as the basis for duplicate detection.

Hash algorithms (SHA-224, SHA-256, SHA-384, SHA-512)  using more bits are generally used for  encryption purposes and are overkill for duplicate identification.

Handling duplicate documents, emails, etc. in a manner that ensures you only process and review that document one time will save you substantial time and money as you move through the processing of working with electronically stored information (ESI).    GOOD PRACTICE:  Use the duplicates reports to reach agreement among everyone involved to remove duplicates PRIOR to processing and review.

 

NEAR Duplicates:

This is where the advanced math comes in, and a bunch of PhD mathematicians have figured out how to identify “Similar” documents.   In layman’s terms,  a hash value is generated for a groups of words  and stored,  systematically the document is decomposed into a sequence of these hashes (known as Shingles) and form a multipart fingerprint to a document.  An Exact duplicate depends on a single Hash value to determine if the documents are exactly the same.     Near duplicate detection looks at the number of  “Exact” same Shingle Hashes that are the same between this document and ALL of the other Shingle Hashes that are in the database.   

Once you know how many matches you have within a document, you also know how much of the document DOES NOT match anything you have in the collection.    By setting a threshold of “Resemblance” you can control how conservative or Aggressive the Near Duplicate Detection engine classifies a document (or group of documents as a duplicate).   If you are more interested in some of the science behind this, have a look at some of the published articles at Princeton.edu.     This link is a good basic primer:  

http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf

So how reliable are these Approaches:

Both approaches are very reliable,  EXACT duplicates are low hanging savings and it should be the NORM to remove or isolate these from the collection before you do any processing or review.    Near duplicates are a little tougher sell if you are negotiating with another party to treat near duplicates as exact duplicates.    The primary reasoning for this is the concept of False Positives and False Negatives – in essence people are fearful that you will treat something as a duplicate that is NOT in fact a duplicate and they will not get a document that should have been treated as responsive and produced.   Expect resistance in regards to eliminating documents based on “Near” duplicate algorithms for a while.

So how do you leverage “Near” duplicate detection:

Using a reasonableness test is a good starting place for this.     FIRST – remove the EXACT duplicates, get them out of the way.   SECOND – tag, group, cluster (whatever you want to call it) the “Near” duplicates so you can deal with possible duplicates together as one.    NEXT – start with an aggressive approach (a Lower % resemblance that causes a document to be suspected as a duplicate).   Look at the documents that resemble other documents,  if you think some are NOT duplicates,  use a higher % resemblance (ie more conservative) setting and repeat the process of looking at the groupings.   When you have reached a resemblance setting that you feel has in fact grouped just the documents that are really duplicates.   You are set to use this quickly look at the near duplicate groups.    In some cases once you are comfortable with the resemblance settings you have arrived at you can use this to treat “Near” duplicates with this degree of resemblance or higher as if they are Exact duplicates for the purpose of reducing the population of documents to be processed and reviewed.     

GOOD PRACTICE:  Document your decision process and tested documents / groups in case you are called upon to explain why you did what you did.  

GOOD PRACTICE:  Reach Agreement on “Near” duplicates ahead of time before you eliminate any of them from the document population.   Some teams,  use “Near” duplicate tagging in the document review process and NOT as a means of  reducing the overall population.  

 

Processing Power:

Because “Near” duplicate detection involves generating exponentially more Hashs it will take much longer to process and create a larger database to house the hashes.     Using a computer to help identify documents which are POSSIBLY duplicates is a very good idea.   Regardless of how you leverage this technology, it will save you time and money.

 View Tom. Strack @ earlycase.com's profile on LinkedIn

eDiscovery early case assessment

Author: admin - Categories: Case Assessment, MetaData, eDiscovery - Tags: , , , ,

earlyCASE is a web-based application which runs on your local PC and analyzes the ESI that your computer can access without the data ever leaving your computer or network. earlyCASE allows you to see and understand all of your data before it is processed for discovery. It supports multiple languages, extracts metadata, generates hash values, detects duplicates and creates a local inventory database of documents and emails. earlyCASE allows users to make informed discovery decisions and easily cut down the size of data sets through filter and culling rules before going into the discovery process and review.

View Tom. Strack @ earlycase.com's profile on LinkedIn