Appraisal documents contain valuable information for analytics and decision-making at various steps of the mortgage process. Extracting and standardizing the data embedded in these documents is the first step but requires automation to avoid manual data entry. PropMix’s digitization solution uses image processing, OCR, and deep learning to process many common appraisal forms to produce MISMO XML from them. We can process PDFs containing parseable content (first generation PDFs) or scanned images (second or higher generation PDFs).
Built on decades of experience in image processing and Artificial Intelligence, our intelligent OCR (Optical Character Recognition) process can extract data from any document.
Appraisal Digitization Challenges and Solutions
Processing both first and second-generation appraisal documents raise certain interesting challenges:
In addition to the above, second or later-generation PDFs pose certain more complex issues because the text in these documents is not parseable and instead we only have images for each page of the form. We rely on our OCR engine to extract the text from such pages and then process the text through a combination of heuristics, statistical models, and machine learning techniques to determine fields and field values. For example, the OCR engine might extract an adjustment value as “\^ $ 6000 |”. But since our processing has mapped the field to an adjustment field we expect to see an amount and so we can deduce that the value must be “$6,000”. Similarly, rules apply to most of the fields including certain higher-level data checking, ex: Census Tract Ids, Flood Zone indicators, Dates, etc.
Digitization at Scale
With all of the complexity explained above, extracting reliable data from an appraisal form is a highly intense process. Thanks to our completely scalable cloud-based platform hosted on AWS we can easily scale to handle high volumes. The system automatically adds more servers into our compute clusters in response to increasing volume so that we can maintain response times within our committed SLA limits.
We can process most common documents within 5 minutes. Processing time can be slightly higher for large (over 30MB) second or later generation PDF documents or documents containing more than 40 pages.
In addition to handling high volumes, the digitization solution is also designed ground up to scale functionally to handle new types of appraisal documents. The system currently supports the following:
All the data extracted from any form is standardized into a common appraisal data model which is reused for all property types – SFR, Condo, 2-4 Unit Multi-Family, Manufactured Homes, etc. This allows us to easily generate any target data format from the standardized data model. The output data format out-of-the-box is MISMO 2.6 GSE, but we can also generate any other custom format as required.
Digitization Quality Control
There are primarily two challenges to ensure the quality of the data produced from the appraisal documents:
Our unique combination of image processing, OCR, and deep learning helps us handle a wide range of document qualities.
We check for consistency within the extracted data using a set of rules. For example, the adjustment numbers need to be mathematically consistent, subject property data needs to be consistent between site/improvement sections and the comparable grid, dates need to be consistent – ex: the effective date of the appraisal vs. the signature date.
Each appraisal is assigned a data quality score after the extraction is completed and if the document does not achieve a target data quality score it will be disqualified for delivery and instead, we would report an error to the client. Such discipline of quality control has helped improve the reliability of the PropMix digitization solution.
Try it now https://propmix.io/appraisal-digitization